Some very basic beta on accessing Hive table data via the Spark SQL. When I tried this the SQL features supported by SparkSQL was not up to that provided by Hive. I was using Hive 0.13 and Spark 1.2.
First make sure that Spark is build with Hive enabled. Building Spark is a separate issue that involves lot’s of considerations. See my post below on integrating native BLAS for MLLib.
Make sure that Spark has access to hive-site.xml, I copied mine to Spark conf folder;
cp [/usr/lib | hadoop install]/hive/conf/hive-site.xml [spark]/conf
Set the YARN conf dir;
export YARN_CONF_DIR=/etc/hadoop/conf
Launch the spark shell in YARN mode
./bin/spark-shell --verbose --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1
Use the SparkContext to get a sql context and then execute sql as so;
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.hql("use spark_test")
hiveContext.hql("INSERT INTO TABLE ...")
HDP Notes;
I’m using Spark 1.4 on Hortonworks 2.2 (Hadoop 2.6)
There is an additional step
Edit
$SPARK_HOME/conf/spark-defaults.conf
and add the following settings:
spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041
Lastly, there is a parsing problem with some of the strings in hive-site.xml
If you see this exception from Hive;
java.lang.RuntimeException: java.lang.NumberFormatException: For input string: “5s”
Then change these settings in hive-site.xml to not have the ‘s’
hive.metastore.client.connect.retry.delay
5s
hive.metastore.client.socket.timeout
1800s
1 comment
Comments feed for this article
July 17, 2015 at 6:21 pm
Sebastián Ramírez (@tiangolo)
Thanks! That “5s” thing was giving me some problems…