Cloudera CDH 12 Run 2 via Hue with Shell

The following setup is not supported by Cloudera under the CDH version 12 or earlier. The configuration is only for informational purposes and you may use it at your own risk and discretion. The only way to run Spark 2 jobs is via Shell in the Hue workflow for now. The latest versions of Cloudera have native support for Spark 2.

CDH 12 is not configured to run Spark 2. The Spark default version 1.6.

It is possible, yet tricky, to configure CDH to run Spark 2 with numerous environment variables, depending on the custom configuration in the cluster through Cloudera Manager (CM).

Since all the instructions exist in Cloudera to upgrade the repo and install and enable Spark 2 in the cluster, we will concentrate on the configuration for Hue and how to run jobs. Spark 2 jobs do not run natively under Hue, as it is Spark 1.6, and must be run with a shell script.

  • Setup HDFS directory for the job
hdfs dfs -ls /user/admin/apps/db-2-hdfs
Found 3 items
-rw-r--r--   3 admin admin       1406 2018-07-05 15:39 /user/admin/apps/db-2-hdfs/db-2-hdfs.sh
-rw-r--r--   3 admin admin      11016 2018-07-14 14:21 /user/admin/apps/db-2-hdfs/db-2-hdfs_2.11-0.1.0-SNAPSHOT.jar
-rw-r--r--   3 admin admin       2202 2018-07-13 11:03 /user/admin/apps/db-2-hdfs/db.properties
  • Create a new Workflow in Hue

    • Query > Scheduler > Workflow > mySpark2Demo
  • Build a Spark 2 job with Eclipse (Scala 2.11 was the latest version I could use) or other development software.
  • Create a Shell to run the Spark 2 jar file and declare any other dependencies

    $ vi db-2-hdfs.sh

    #!/usr/bin/env bash
    
    # Author: Levi Hernandez
    # Description:
    # Connect to an Oracle DB and copy tables into HDFS via Spark 2.
    # I am building my own DB copy process because Sqoop 2 does not meet DIFF needs
    
    # Spark 2 ran by Hue Oozie workflows, require unsetting the Hadoop Conf Dir
    unset HADOOP_CONF_DIR
    
    # Location of the application home dir
    apphome=hdfs://nameservice1/user/admin/apps/db-2-hdfs
    
    # Location of Oracle, MySQL, and PostgreSQL DB drivers location in HDFS
    applib=hdfs://nameservice1/user/admin/drivers
    
    # First custom parameter declared in Hue
    frst=${1}
    
    # Second custom parameter declared in Hue
    scnd=${2}
    
    # I use custom files stored in HDFS to hold DB information for connectivity to an external DB
    # db.properties is a file containing information about tablename, database type, etc
    # ${frst} is the source table in the DB I want to copy
    # ${scnd} is the destination table in Hive
    # --files is the location of my job's files
    
    # Basically, the shell and Hue require file location because Spark 2 creates its own temp environment to host all the jars, properties, and other files. Once the job completes, the temp environment is removed.
    
    # Execute the Spark 2 JAR, spark_submit points to spark_submit2 and was changed in alternatives
    spark-submit --verbose --class edu.domain.Reports --master yarn --deploy-mode cluster --jars ${applib}/ojdbc8.jar --driver-class-path ${applib}/ojdbc8.jar --files ${apphome}/db.properties ${apphome}/db-2-hdfs_2.11-0.1.0-SNAPSHOT.jar "filepath:db.properties" "table:${frst}" "table-name:${scnd}"
    

Configure Hue to run the Shell script containing our Spark 2 jar and dependencies

  • Drag the “Shell Command” icon into the workflow
  • Enter the name of the shell script “db-2-hdfs.sh” in the name of the job field
  • In the “Arguments” area, declare the values of the ${frst} and ${scnd} variables

  • In the “Files” area, next to “Arguments” enter the name of the JAR file to run

    • I entered a variable named ${fileshell}, which is defined in the job’s global “Settings” area (gear icon on the top right of the job menu)
  • Clicked on the multi-gear icon to edit the environment variables

    • Set “Properties tab > ENVIRONMENT VARIABLES” input to “HADOOP_USER_NAME=${wf:user()}”
  • Clicked on the global settings, single gear icon at the top right of the workflow screen, and setup the following:

    • Variables Area Inputs:
    • oozie.use.system.libpath = true
    • fileshell = /user/admin/apps/db-2-hdfs/db-2-hdfs.sh

    • Saved the workflow and ran it

    • When failures occurred, I opened the Spark2 Service and checked logs for failures with YARN.

My environment is configured for High Availability (thus I use namenode instead of the actual server domain), only use Spark 2 “spark_submit” modified with alternatives and SSL.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax