Some very basic beta on accessing Hive table data via the Spark SQL. When I tried this the SQL features supported by SparkSQL was not up to that provided by Hive.  I was using Hive 0.13 and Spark 1.2.

First make sure that Spark is build with Hive enabled.  Building Spark is a separate issue that involves lot’s of considerations. See my post below on integrating native BLAS for MLLib.

Make sure that Spark has access to hive-site.xml, I copied mine to Spark conf folder;
cp [/usr/lib | hadoop install]/hive/conf/hive-site.xml [spark]/conf

Set the YARN conf dir;

export YARN_CONF_DIR=/etc/hadoop/conf

Launch the spark shell in YARN mode

./bin/spark-shell --verbose --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1

Use the SparkContext to get a sql context and then execute sql as so;

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.hql("use spark_test")
hiveContext.hql("INSERT INTO TABLE ...")

HDP Notes;

I’m using Spark 1.4 on Hortonworks 2.2 (Hadoop 2.6)
There is an additional step

Edit
$SPARK_HOME/conf/spark-defaults.conf and add the following settings:
spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041

Lastly, there is a parsing problem with some of the strings in hive-site.xml
If you see this exception from Hive;
java.lang.RuntimeException: java.lang.NumberFormatException: For input string: “5s”

Then change these settings in hive-site.xml to not have the ‘s’


hive.metastore.client.connect.retry.delay
5s

hive.metastore.client.socket.timeout
1800s