Spark Hive Integration

Some very basic beta on accessing Hive table data via the Spark SQL. When I tried this the SQL features supported by SparkSQL was not up to that provided by Hive. I was using Hive 0.13 and Spark 1.2.

First make sure that Spark is build with Hive enabled. Building Spark is a separate issue that involves lot’s of considerations. See my post below on integrating native BLAS for MLLib.

Make sure that Spark has access to hive-site.xml, I copied mine to Spark conf folder;
cp [/usr/lib | hadoop install]/hive/conf/hive-site.xml [spark]/conf

Set the YARN conf dir;

export YARN_CONF_DIR=/etc/hadoop/conf

Launch the spark shell in YARN mode

./bin/spark-shell --verbose --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1

Use the SparkContext to get a sql context and then execute sql as so;

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext.hql("use spark_test") hiveContext.hql("INSERT INTO TABLE ...")

HDP Notes;

I’m using Spark 1.4 on Hortonworks 2.2 (Hadoop 2.6)
There is an additional step

Edit
$SPARK_HOME/conf/spark-defaults.conf and add the following settings:
spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041

Lastly, there is a parsing problem with some of the strings in hive-site.xml
If you see this exception from Hive;
java.lang.RuntimeException: java.lang.NumberFormatException: For input string: “5s”

Then change these settings in hive-site.xml to not have the ‘s’

hive.metastore.client.connect.retry.delay 5s

hive.metastore.client.socket.timeout 1800s

Spark Hive Integration

Recent Posts

Archives

Meta

1 comment

Leave a comment Cancel reply

Spark Hive Integration

Share this:

Related

Categories

Recent Posts

Archives

Meta

1 comment

Leave a comment Cancel reply