Bruce B Campbell

Hierarchical Bayesian Modelling

December 10, 2018 in Uncategorized | Leave a comment

Zen of Modelling
from http://modernstatisticalworkflow.blogspot.com
Your model should have some theoretical basis.
Your model, when simulated, should produce outcomes with a similar density to the observed values. Similarly, your model should not place weight on the impossible (like negative quantities, or binary outcomes that aren’t binary). It should place non-zero weight on possible but unlikely outcomes.
Think deeply about what is a random variable and what is not. A good rule of thumb: random variables are those things we do not know for certain out of sample. Your model is a joint density over the random variables.
You never have enough observations to distinguish one possible data generating process from another process that has different implications. You should model both, giving both models weight in decision-making.
The point of estimating a model on a big dataset is to estimate a rich model (one with many parameters). Using millions of observations to estimate a model with dozens of parameters is a waste of electricity.
Unless you have run a very large, very well-designed experiment, your problem has unobserved confounding information. If this problem does not occupy a lot of your time, you are doing something wrong.
Fixed effects normally aren’t. Mean reversion applies to most things, including unobserved information. Don’t be afraid to shrink.
Relationships observed in one group can almost always help us form better understanding of relationships in another group. Learn and use partial pooling techniques to benefit from this.
For decision-making, your estimated standard deviations are too small; your estimated degrees of freedom are too big, or your have confused one for the other. Remember, the uncertainty produced by your model is the amount of uncertainty you should have if your model is correct and the process you are modeling does not change.
You always have more information than exist in your data. Be a Bayesian, and use this outside information in your priors.

Bayesian Analysis
http://www.columbia.edu/~cjd11/charles_dimaggio/DIRE/styled-4/styled-11/code-8/#fnref2

Random Effects and Pooling
http://modernstatisticalworkflow.blogspot.com/2016/11/random-effects-partial-pooling-and.html

PPD
http://modernstatisticalworkflow.blogspot.com/2016/08/why-you-should-be-posterior-predictive.html

Hierarchical Partial Pooling
https://docs.pymc.io/notebooks/hierarchical_partial_pooling.html

Fix tidyverse missing gfortran error

May 8, 2018 in Uncategorized | Leave a comment

sudo ln -s /usr/lib64/libgfortran.so.3 /usr/lib64/libgfortran.so sudo ln -s /usr/lib64/libquadmath.so.0.0.0 /usr/lib64/libquadmath.so

SublimeText3 remote editing

April 12, 2018 in Uncategorized | Leave a comment

Install rmate on the remote

gem install rmate

Install rsub in SublimeText3 via PackageControl and set up port forwarding as so if you’re using Putty ;

RStudio hangs on EMR master node

January 5, 2018 in EMR, R | Leave a comment

Occasionally RStudio hangs when using the SparkR context. I surmise some resource is not being managed well. The symptoms come from running a few Spark jobs and are manifest as a frozen UI after a bit of time has passed. The jobs run fine.

Here’s a gist with my reset mechanism.
sudo rstudio-server stop cd ~ rm -rf .rstudio/ rm -rf .RData rm -rf .Rhistory sudo rstudio-server start

Hadoop Stress Test

September 10, 2015 in Uncategorized | Tags: haddop | Leave a comment

Hadoop stress testing can be run from the hadoop-mapreduce-client-jobclient-

Here are the options available
DFSCIOTest: Distributed i/o benchmark of libhdfs.
DistributedFSCheck: Distributed checkup of the file system consistency.
JHLogAnalyzer: Job History Log analyzer.
MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures
NNdataGenerator: Generate the data to be used by NNloadGenerator
NNloadGenerator: Generate load on Namenode
NNstructureGenerator: Generate the structure to be used by NNdatagenerator
SliveTest: HDFS Stress Test and Live Data Verification.
TestDFSIO: Distributed i/o benchmark.
fail: a job that always fails
filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed)
largesorter: Large-Sort tester
loadgen: Generic map/reduce load generator
mapredtest: A map/reduce test check.
minicluster: Single process HDFS and MR cluster.
mrbench: A map/reduce benchmark that can create many small jobs
nnbench: A benchmark that stresses the namenode.
sleep: A job that sleeps at each map and reduce task.
testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce
testfilesystem: A test for FileSystem read/write.
testmapredsort: A map/reduce program that validates the map-reduce framework’s sort.
testsequencefile: A test for flat files of binary key value pairs.
testsequencefileinputformat: A test for sequence file input format.
testtextinputformat: A test for text input format.
threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill

Here is an example of running the hdfs test;

First the write test. Note this writes results to a local directory. Data is written to /benchmarks in hdfs

hadoop jar /usr/hdp/2.2.0.0-2041/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 4 -fileSize 1GB -resFile /home/hdfs/test/out

Then run the read test

hadoop jar /usr/hdp/2.2.0.0-2041/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -read -nrFiles 4 -fileSize 1GB -resFile /home/hdfs/test/out

Running Spark-1.4 On Windows

September 7, 2015 in Hadoop, Spark | Leave a comment

Granted, we don’t get the benefits of hdfs or YARN, but I occasionally do my development on Windows and find it useful to have a version of Spark available for running under the debugger or trying some things out.

First install (unzip) java, maven, scala, sbt, and a spark distro

Put together a DOS env script (ends with .cmd) that looks like this;

SET PATH=%PATH%;C:\JavaDev\maven-3.2.5\bin;C:\JavaDev\scala-2.10.5\bin;C:\JavaDev\sbt\ SET JAVA_HOME=C:\JavaDev\jdk1.7.79 SET SCALA_HOME=C:\JavaDev\scala-2.10.5 SET MAVEN_HOME=C:\JavaDev\maven-3.2.5\bin SET SPARK_HOME=C:\JavaDev\spark-1.4.0 REM SET MAVEN_OPTS="-Xmx1024M -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"

Run the spark build

mvn -DskipTests clean package

If Maven runs out of memory (it will) un-comment the MAVEN_OPTS line in the environment script, or set the java settings on the build command line

mvn -DskipTests clean package -DargLine="-Xmx1524m"

I like to have all the sources available for debugging – here is a recipe for getting that set up. It works on Windows and Linux.

There are 2 ways to get the sources for all dependencies in Maven project.

1)
Specifying -DdownloadSources=true -DdownloadJavadocs=true at the command line when building.

2)
Open your settings.xml file (~/.m2/settings.xml). Add a section with the properties added. Then make sure the activeProfiles contains this

<profiles>
<profile>
<id>downloadSources</id>
<properties>
<downloadSources>true</downloadSources>
<downloadJavadocs>true</downloadJavadocs>
</properties>
</profile>
</profiles>

<activeProfiles>
<activeProfile>downloadSources</activeProfile>
</activeProfiles>

Random Notes on approximate query optimization in the Hadoop ecosystem

September 6, 2015 in Uncategorized | Tags: haddop, Hive, spark | Leave a comment

Discardable Memory and Materialized Queries
http://hortonworks.com/blog/dmmq/

Materialized views with adapters for MongoDB, Apache Drill, and Spark

http://calcite.incubator.apache.org/

Nice article on probabilistic methods for aggregation

Probabilistic Data Structures for Web Analytics and Data Mining

Nice article on Sketches

Click to access sketches1.pdf

Install Desktop GNOME and connect to Amazon EC2 RHEL 7 and connect via VNC

August 23, 2015 in Amazon Cloud, Linux | Leave a comment

sudo yum update sudo yum groupinstall 'Server with GUI' sudo systemctl set-default graphical.target sudo yum install tigervnc-server sudo cp /lib/systemd/system/vncserver@.service /etc/systemd/system/vncserver@.service sudo vi /etc/systemd/system/vncserver@.service

Put ec2-user where you find the text

sudo systemctl daemon-reload

Then as user

vncpasswd

sudo firewall-cmd --zone=public --add-port=5900/tcp sudo firewall-cmd --zone=public --add-port=5901/tcp sudo firewall-cmd --zone=public --add-port=5902/tcp sudo firewall-cmd --zone=public --add-port=5903/tcp sudo systemctl start vncserver@:1.service

1 in the last line corresponds to port 5901
To start another session use
systemctl start vncserver@:2.service

Then connect via a VNC client putting :1 for the session number 1 after the server ip/name.

ie ec2-XX-XX-XX-XX.compute-1.amazonaws.com:1 whwew XX-XX-XX-XX is the external ip address.

Fix Ambari error – “Host Role in Invalid State”.

July 29, 2015 in Hadoop | Leave a comment

Ambari error – “Host Role in Invalid State”.

If the Ambari agent gets in a bad state you might encounter this error. Try to restart the Ambari client on the offending nodes;

ambari-agent restart

If this fails try to reinstall the clients from the Ambari hosts dashboard

If you’re upgrading Ambari from 1.6.0 to 1.7.0 there will be some properties that don’t match up. This is a bigger pain to fix up.

Install R on RHEL 6.6

July 29, 2015 in CRAN R, R | Leave a comment

Here are the commands to install R on RHEL 6.5

wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm yum localinstall epel-release-6-8.noarch.rpm yum install R

When installing R on RHEL 6.6 I encountered the following dependency problem

--> Finished Dependency Resolution Error: Package: R-core-devel-3.2.1-1.el6.x86_64 (epel) Requires: libicu-devel Error: Package: R-core-devel-3.2.1-1.el6.x86_64 (epel) Requires: blas-devel >= 3.0 Error: Package: R-core-devel-3.2.1-1.el6.x86_64 (epel) Requires: lapack-devel Error: Package: R-core-devel-3.2.1-1.el6.x86_64 (epel) Requires: texinfo-tex

Execute these commands in order as su

wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm yum localinstall epel-release-6-8.noarch.rpm wget http://mirror.centos.org/centos/6/os/x86_64/Packages/lapack-devel-3.2.1-4.el6.x86_64.rpm wget http://mirror.centos.org/centos/6/os/x86_64/Packages/blas-devel-3.2.1-4.el6.x86_64.rpm wget http://mirror.centos.org/centos/6/os/x86_64/Packages/texinfo-tex-4.13a-8.el6.x86_64.rpm wget http://mirror.centos.org/centos/6/os/x86_64/Packages/libicu-devel-4.2.1-9.1.el6_2.x86_64.rpm sudo yum localinstall *.rpm yum install R

Notes : Integrating R with Spark

July 24, 2015 in Hadoop, R, Spark | Leave a comment

Rscript

require(multicore) cat(sprintf("Multicore functions running on maximum %d cores", ifelse(length(commandArgs(trailingOnly=TRUE)), cores <- commandArgs(trailingOnly=TRUE)[1], cores <- 1)))
## Multicore functions running on maximum 1 cores
cores is set to 1, but when I run as an Rscript, I can specify how many cores I want to use.

add mc.cores = cores to mclapply calls :

# Example processor hungry multicore operation mats <- mclapply(1:500, function(x) matrix(rnorm(x*x),ncol = x) %*% matrix(rnorm(x*x), ncol = x), mc.cores = cores)

$ Rscript –vanilla R/myscript.R 12 &> logs/myscript.log &
Rscript runs the .R file as a standalone script, without going into the R environment. The –vanilla flag means that you run the script without calling in your .Rprofile (which is typically set up for interactive use) and without prompting you to save a workspace image.

## example #! script for a Unix-alike #! /path/to/Rscript --vanilla --default-packages=utils args <- commandArgs(TRUE) res <- try(install.packages(args)) if(inherits(res, "try-error")) q(status=1) else q()

SparkR needs R and rJava.
install.packages("rJava")
You can check if rJava is installed correctly by running
library(rJava)
If you get no output after the above command, it means that rJava has been installed successfully! If you get an error message while installing rJava, then you might need to the following to configure Java with R. Exit R, run the following command in a shell and relaunch R to install rJava.
usb/$ R CMD javareconf -e

The Spark API for launching R worker

public static org.apache.spark.api.r.BufferedStreamThread createRWorker(String rLibDir, int port)

Zookeeper – Basics

July 15, 2015 in Hadoop | Tags: haddop, spark | Leave a comment

Apache Zookeeper is a distributed service providing application infrastructure for common distributed coordination tasks such as configuration management, distributed synchronization objects such as locks and barriers, leader election. Zookeeper can be used for cluster membership ops such as leader election and adding & removing of nodes. Zookeeper is used in the Hadoop ecosystem for high availability provisioning of YARN Resource Manager, HDFS NameNode fail over, HBase Master, and Spark Master.

The ZooKeeper service provides the abstraction of a set of data nodes -called znodes – organized into a hierarchical name space. The hierarchy of znodes in the namespace provide the objects used to keep state information. Access to nodes are provided my paths in the hierarchy.

Zookeeper uses the Zab distributed consensus algorithms this is similar to the classical Paxos algorithms. Zab and Paxos both follow a protocol where leader proposes values to the followers and then the leaders wait for acknowledgements from a quorum of followers before considering a proposal committed. Proposals include epoch numbers (ballot numbers in Paxos) that are unique version numbers.

In addition to configuration management, distributed locks and group membership algorithms; Zookeeper primitives can be used to implement;

Double barriers enable clients to synchronize the beginning and the end of a computation.
When enough processes, defined by the barrier threshold, have joined the barrier, processes start their computation and leave the barrier once they have finished.

Sometimes in distributed systems, it is not always clear a priori what the final system configuration will look like. For example, a client may want to start a master process and several worker processes, but the starting processes is done by a scheduler, so the client does not know ahead of time information such as addresses and ports that it can give the worker processes to connect to the master. We handle this scenario with ZooKeeper using a rendezvous znode.

Zookeeper Zab protocol messages are encapsulated in a QuorumPacket;
class QuorumPacket { int type; // Request, Ack, Commit, Ping, etc long zxid; buffer data; vector authinfo; // only used for requests }

The basic API for manipulating nodes.

create(path, data, flags): Creates a znode with path name path, stores data[] in it, and returns the name of the new znode. flags enables a client to select the type of znode: regular, ephemeral, and set the sequential flag

delete(path, version): Deletes the znode path if that znode is at the expected version

getData(path, watch): Returns the data and meta-data, such as version information, associated with the znode

setData(path, data, version): Writes data[] to znode path if the version number is the current version of the znode

getChildren(path, watch): Returns the set of names of the children of a znode

sync(path): Waits for all updates pending at the start of the operation to propagate to the server that the client is connected to

Watches are used to monitor state changes. ZooKeeper watches are one-time triggers and due to the latency involved between getting a watch event and resetting of the watch, it’s possible that a client might lose changes done to a znode during this interval. In a distributed application in which a znode changes multiple times between the dispatch of an event and resetting the watch for events, developers must be careful to handle such situations in the application logic.

Spark Tuning Notes

June 30, 2015 in Hadoop, Spark | Tags: haddop, spark | Leave a comment

Settings are set in a number of ways;

$SPARK_HOME/conf/spark-defaults.conf – make a copy of the template and edit this if the file does not exist.
On the command line when submitting jobs or starting up the shell
Directly on the SparkContext object.

Shuffling
Spark stores intermediate data on disk from a shuffle operation as part of its “under-the-hood” optimization. When spark has to recompute a portion of a RDD graph, it may be able to truncate the lineage of a RDD graph if the RDD is already there as a side effect of an earlier shuffle. This can happen even if the RDD is not cached or explicitly persisted. Set the spark.shuffle.spill=false to turn this off if it is not needed.

Caching
The caching mechanism reserves a % of memory from the executor. This is specified in spark.storage.memoryFraction

Partitioning
Use more partitions as data size increases

Broadcast Variables
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

Memory Leaks
Closing over objects in lambdas can cause memory leaks. Check the size of the serialized Spark task to make sure there are no leaks.

Spark Hive Integration

June 23, 2015 in Github | 1 comment

Some very basic beta on accessing Hive table data via the Spark SQL. When I tried this the SQL features supported by SparkSQL was not up to that provided by Hive. I was using Hive 0.13 and Spark 1.2.

First make sure that Spark is build with Hive enabled. Building Spark is a separate issue that involves lot’s of considerations. See my post below on integrating native BLAS for MLLib.

Make sure that Spark has access to hive-site.xml, I copied mine to Spark conf folder;
cp [/usr/lib | hadoop install]/hive/conf/hive-site.xml [spark]/conf

Set the YARN conf dir;

export YARN_CONF_DIR=/etc/hadoop/conf

Launch the spark shell in YARN mode

./bin/spark-shell --verbose --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1

Use the SparkContext to get a sql context and then execute sql as so;

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext.hql("use spark_test") hiveContext.hql("INSERT INTO TABLE ...")

HDP Notes;

I’m using Spark 1.4 on Hortonworks 2.2 (Hadoop 2.6)
There is an additional step

Edit
$SPARK_HOME/conf/spark-defaults.conf and add the following settings:
spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041

Lastly, there is a parsing problem with some of the strings in hive-site.xml
If you see this exception from Hive;
java.lang.RuntimeException: java.lang.NumberFormatException: For input string: “5s”

Then change these settings in hive-site.xml to not have the ‘s’

hive.metastore.client.connect.retry.delay 5s

hive.metastore.client.socket.timeout 1800s

IntelliJ Gitub Integration On Windows : Cannot run program “git.exe”: CreateProcess error=2, The system cannot find the file specified

June 23, 2015 in Github | Leave a comment

I’m setting up a Java / Scala /Spark development environment in windows and encountered this annoying error where I could not get IntelliJ to interact with GitHub. There are two steps to achieving integration, one is to set up account credentials and the other is to enable git.

This describes the registration process

https://www.jetbrains.com/idea/help/registering-github-account-in-intellij-idea.html

If you run into this error

Cannot run program “git.exe”: CreateProcess error=2, The system cannot find the file specified

Download Github For Windows client and install it.
Don’t try to find the path to git.exe by right clicking on the GitHub desktop icon – it is a manifest file that gives no hint where the exe resides.
Add git.exe location to your “Path Variable”. The location you should add will probably be something like : C:\Users\Your_Username\AppData\Local\GitHub\PortableGit_ca477551eeb4aea0e4ae9fcd3358bd96720bb5c8\bin
OR Set the path to git in the git section of the IntellJ VCS setup

Set the logging level for spark-shell

June 22, 2015 in Uncategorized | Tags: haddop, spark | Leave a comment

Sometimes the logging output from spark-shell can get annoying, or your results can get lost in the content.

Use the instructions below to set the logging level

In the conf folder find log4j.properties.template
make a copy of this file and rename it to log4j.properties
Edit the file replacing INFO with WARN or ERROR

Build & Run Spark / MLLIb With Native BLAS & LAPACK

March 11, 2015 in Hadoop, Spark | Leave a comment

I’m using Hadoop 2.6.0 – There is no 2.6 profile so use 2.4.

mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -DskipTests clean package

Install gfortran

Link blas to /usr/lib/libblas.so.3 and LAPACK /usr/lib/liblapack.so.3 If building your own BLAS / LAPACK the instructions are to build without mutithreading. I do not know what the this translates to for use with Intel MKL.

Set up Java Scala Spark Hadoop environment variables – here’s my example

export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_72 export PATH=$JAVA_HOME/bin:$PATH export SCALA_HOME=/home/bcampbell/spark/scala-2.10.4 export SCALA_BIN=$SCALA_HOME/bin export PATH=$SCALA_BIN:$PATH export SPARK_HOME=/home/bcampbell/spark/spark-1.2.1 export SPARK_BIN=$SPARK_HOME/bin export PATH=$SPARK_BIN:$PATH

source /etc/hadoop/conf/hadoop-env.sh source /etc/hive/conf/hive-env.sh export HIVE_HOME=/usr/hdp/2.2.0.0-2041/hive export HADOOP_HOME=/usr/hdp/2.2.0.0-2041/hadoop export HCAT_HOME=/usr/hdp/2.2.0.0-2041/hive-hcatalog export PATH=$PATH:$HIVE_HOME/bin export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/usr/hdp/2.2.0.0-2041/hive/lib:/usr/hdp/2.2.0.0-2041/hive-hcatalog/share/hcatalog export HADOOP_CONF_DIR=/usr/hdp/2.2.0.0-2041/hadoop/conf #export YARN_CONF_DIR

Hadoop Sequence file writer – compression & large value considerations.

March 10, 2015 in Hadoop, java | Leave a comment

Some further notes from developing a sequence file writer designed to store large number of data files in the SequenceFile format.

There is a bug (in 2.6.0) for large values, the writer will not accept values greater than 2GB and throws a java.lang.NegativeArraySizeException

If your implementation returns keys and values as BytesWritable instances containing raw bytes the first four bytes of the returned buffer will contain run length, as defined by Writable in the java.io.DataOutput implementation. This 4 byte prefix must be stripped if you’re after only the bytes which the original BytesWriteable instance.

Do not call getBytes from the writable if you want to avoid the issue above. Call copyBytes from the BytesWriteable value object.
Also,don’t call write on the value in this case or you’ll get the 4 extra bytes mentioned above.

Snippet :

	Writable key = (Writable) ReflectionUtils.newInstance(
	sequenceFileReader.getKeyClass(), conf);
	Writable value = (Writable) ReflectionUtils.newInstance(
	sequenceFileReader.getValueClass(), conf);
	boolean next = true;
	while (next) {
	try {
	if (sequenceFileReader.next(key, value)) {
	next = true;
	} else {
	next = false;
	continue;
	}
	DataOutputStream output = new DataOutputStream(
	new FileOutputStream(key.toString()));

	BytesWritable bw = (BytesWritable) value;
	byte[] bytes = bw.copyBytes();
	int length = bytes.length;

	//Only One of These
	output.write(bytes);
	//Can't do this
	//value.write(output);

	output.flush();
	output.close();
	} catch (Exception ex) {
	log.error(ex.toString());
	}
	}

view raw

gistfile1.txt

hosted with ❤ by GitHub

Hadoop s3 distcp notes

March 3, 2015 in Amazon Cloud, Hadoop, Uncategorized | Leave a comment

Distcp is distributed copy of data from one cluster to other cluster.
It also supports the movement of data from s3 to hdfs and vice versa,
distcp can not be used to copy data between directories inside the same cluster.

dstcopy can move;
hdfs file
HBase table files
physical blocks

distcp can also to run an internal MapReduce job to copy files

If you want to use s3 standalone use s3n. s3n (the s3 native protocol) has a 5 Gig file size limitation of amazon.

The NameNode must be set up to access s3


    fs.s3.awsAccessKeyId

    MY-ID
    fs.s3.awsSecretAccessKey

    MY-SECRET

    fs.s3n.awsAccessKeyId

    MY-ID

fs.s3n.awsSecretAccessKey MY-SECRET

You can test the upload
hadoop fs -ls s3://bucket-name/

Example;
su hdfs $hadoop distcp s3n://AccessKey:SecretAccessKey@bucketname/input hdfs://[cluster name ]:8020/use

HDFS SequenceFile compression options

February 19, 2015 in Hadoop | Leave a comment

There are two key parameters to set when compressing values in a sequence file; the compression codec and the compression type.

Compression type indicates whether records are compressed, and if so whether they are record compressed or block compressed. Block compression may be more performant than if there is similarity between records. Additionally, a compressed record may span multiple blocks.

The codecs are classified as splittable or not. Most compression algorithms work on a stream and can not start decompressing mid-way. This is sub-optimal for HDFS where data is stored in blocks.

Splittable means that hdfs blocks can be decompressed in parallel and blocks do not need to be co-located for sequencefile decompression.

Bzip2 is a splittable format. The other codecs available are gzip, zlib, and lz4. LZO is available but has to be plugged into your Hadoop cluster.

Generally, Bzip2 is more compute intensive and yields better space savings than the other formats. Obviously this will be data dependent.

You may encounter this error if you do not have zlib native libraries accessable;
java.lang.IllegalArgumentException: SequenceFile doesn't work with GzipCodec without native-hadoop code!
Install zlib / zlib-devel and set up

Here is a gist with code snippets to get started

	//Other options for typeare record and block. I'n not sure that block will work with any other codec than bzip2
	CompressionType compressionType= CompressionType.NONE;
	compressionCodecEnum {gzip, bzip2,none};
	compressionCodecEnum compressionCodecType = compressionCodecEnum.bzip2;

	if( compressionCodecType==compressionCodecEnum.bzip2)
	{
	org.apache.hadoop.io.SequenceFile.Writer.Option compressionClass = SequenceFile.Writer.valueClass(GzipCodec.class);
	CompressionCodec Codec = new BZip2Codec();
	org.apache.hadoop.io.SequenceFile.Writer.Option optCom = SequenceFile.Writer.compression(CompressionType.BLOCK, Codec);
	sequenceFileWriter = SequenceFile.createWriter(conf, filePath, keyClass, valueClass,optCom);
	}
	if( compressionCodecType==compressionCodecEnum.gzip)
	{
	org.apache.hadoop.io.SequenceFile.Writer.Option compressionClass = SequenceFile.Writer.valueClass(GzipCodec.class);
	CompressionCodec Codec = new GzipCodec();
	org.apache.hadoop.io.SequenceFile.Writer.Option optCom = SequenceFile.Writer.compression(CompressionType.RECORD, Codec);
	sequenceFileWriter = SequenceFile.createWriter(conf, filePath, keyClass, valueClass,optCom);
	}

	if( compressionCodecType==compressionCodecEnum.none)
	{
	sequenceFileWriter = SequenceFile.createWriter(conf, filePath, keyClass, valueClass);
	}

view raw

gistfile1.txt

hosted with ❤ by GitHub

Fix Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from localhost.localdomain

February 14, 2015 in Hadoop | Tags: haddop | Leave a comment

This error was encountered starting up the NodeManager (YARN) after Hadoop 2.4 re install using Ambari 1.6

It turns out that the old ResourceManager gets listed in an exclude YARN conf file. This entry needs to be removed from the yarn.exclude file in order to get the NodeNode manageer to connect to the ResourceManager.

Ie on a pseudodistributed cluster (HDP 2.1)
cd /etc/hadoop/conf/ more yarn.exclude localhost.localdomain

Then remove the line localhost.localdomain

and restart the YARN service in Ambari.

Fix yum metadata error installing *

February 9, 2015 in Linux | Tags: linux | Leave a comment

Occasioanally the yum metatdata store will get corrupted. You may see an error like this;

[Errno -1] Metadata file does not match checksum Trying other mirror. Error: failure: [Errno 256] No more mirrors to try.You could try using --skip-broken to work around the problem

1) Try this first
yum clean metadata

2) If that does not ameliorate the issue then clean all
yum clean all

Generally it is not advisable to –skip-broken.

Install Maven on CentOS / RHEL 6.x

February 8, 2015 in JavaScript, Linux | Tags: Java, linux | Leave a comment

1) wget http://mirror.sdunix.com/apache/maven/maven-3/3.2.5/binaries/apache-maven-3.2.5-bin.tar.gz

2)tar xvf apache-maven-3.2.5-bin.tar.gz

3) mv apache-maven-3.2.5 /usr/local/apache-maven

4) Add env variables to your ~/.bashrc file

export M2_HOME=/usr/local/apache-maven export M2=$M2_HOME/bin export PATH=$M2:$PATH

Then run source ~/.bashrc

Spark Hive Notes

February 5, 2015 in Hadoop, Spark | Leave a comment

Some notes on building Spark for Hive. My ultimate goal is to run Hive 0.14 on Spark 1.12

Notes from Spark 1.1;
How to make spark sql run on apache-hive 0.13.0

1. build org.spark-project.hive 0.13.0 related jars from apache-hive
a. checkout out code from apache hive release-0.13.0
b. change protobuf.verson in pom.xml from 2.5.0 to 2.4.1 since spark 1.0 uses 2.4.1 version protobuf
c. use protobuf-java-2.4.1-shaded to generate a new ql/src/gen/protobuf/gen-java/org/apache/hadoop/hive/ql/io/orc/OrcProto.java
from ql/src/protobuf/org/apache/hadoop/hive/ql/io/orc/orc_proto.proto
d. build your own 0.13.0 jars to be used in spark sql
org.spark-project.hive:hive-metastore:0.13.0
org.spark-project.hive:hive-exec:0.13.0
org.spark-project.hive:hive-serde:0.13.0

2. make it compilable with hive 0.13 in spark sql
you can apply the patch in attachment to solve the incompatibility issues

————–
The Eclipse “Scala IDE” 4.0 RC1 can now support projects using different versions of Scala, which is convenient for Spark’s current 2.10.4 support and emerging 2.11 support.

Install Hadoop Source For Hortonworks 2.2

December 17, 2014 in Uncategorized | Leave a comment

Step 1)
wget -nv http://public-repo-1.hortonworks.com/HDP/centos6/2.x/GA/2.2.0.0/hdp.repo -O /etc/yum.repos.d/hdp.repo

Step 2)Find the source package
yum search hadoop |grep source
yum search hadoop | grep source
hadoop-source.noarch : hadoop-source HDP virtual package
hadoop-yarn-resourcemanager.noarch : hadoop-yarn-resourcemanager HDP virtual
hadoop_2_2_0_0_2041-source.x86_64 : Source code for Hadoop
hadoop_2_2_0_0_2041-yarn-resourcemanager.x86_64 : YARN Resource Manager

Step 3) Install
yum install hadoop_2_2_0_0_2041-source.x86_6

The code ends up at
/usr/hdp/2.2.0.0-2041/hadoop/src/

HDFS Append Fails in cluster with less than 3 nodes.

December 12, 2014 in Hadoop | Leave a comment

For development you may find yourself in the situation where you’re manipulating hdfs files and the replication factor is greater than the number of nodes. I’ve previously noted some of the errors that one can encounter in this situation. To remedy this we set dfs.replication to 1 in our hdfs-site.xml config file. Recently I encountered a strange error when trying to append to an existing hdfs file;

java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[10.22.7.81:50010], original=[10.22.7.81:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:960) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1026) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1175) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:531)

If you encounter this error – read here;

https://issues.apache.org/jira/browse/HDFS-4600

And then set
dfs.client.block.write.replace-datanode-on-failure.enable to false

Here’s what the documentation says about this parameter;

dfs.client.block.write.replace-datanode-on-failure.enable true If there is a datanode/network failure in the write pipeline, DFSClient will try to remove the failed datanode from the pipeline and then continue writing with the remaining datanodes. As a result, the number of datanodes in the pipeline is decreased. The feature is to add new datanodes to the pipeline. This is a site-wide property to enable/disable the feature. When the cluster size is extremely small, e.g. 3 nodes or less, cluster administrators may want to set the policy to NEVER in the default configuration file or disable this feature. Otherwise, users may experience an unusually high rate of pipeline failures since it is impossible to find new datanodes for replacement. See also dfs.client.block.write.replace-datanode-on-failure.policy
dfs.client.block.write.replace-datanode-on-failure.policy DEFAULT This property is used only if the value of dfs.client.block.write.replace-datanode-on-failure.enable is true. ALWAYS: always add a new datanode when an existing datanode is removed. NEVER: never add a new datanode. DEFAULT: Let r be the replication number. Let n be the number of existing datanodes. Add a new datanode only if r is greater than or equal to 3 and either (1) floor(r/2) is greater than or equal to n; or (2) r is greater than n and the block is hflushed/appended.

I suspect that when appending a block replace is required and that this is the source of the problem.

Fix Hadoop HDFS Error java.io.IOException: No FileSystem for scheme: hdfs at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2385)

December 11, 2014 in Hadoop | 1 comment

I got this error when writing a Sequence File reader / writer class in Java. The code worked in Eclipse, but not on the command line. This causes me concern, because I was very careful to set up the POM correctly. In any event, initial research on the internet indicated that the problem was a classpath issue. I tried without success to resolve the issue.

This instructions in this post were able to get me past the issue;
http://stackoverflow.com/questions/17265002/hadoop-no-filesystem-for-scheme-file

The explanation made sense to me.

Configuration conf = new Configuration(); conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName()); conf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());

Hortonworks Fix missing jar error in Hive after upgrade to HDP 2.2

December 11, 2014 in Hadoop | Tags: Hive | Leave a comment

Fix missing jar error in Hive after upgrade to HDP 2.2

I got this error after upgrading to HDP 2.2
java.io.FileNotFoundException: File file:/usr/lib/hcatalog/share/hcatalog/hcatalog-core.jar does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)

This jar is now located in a new folder and aparantly the location was not set in the new hive-en.sh

To fix this change the jar location in /etc/hive/conf/hive-env.sh

export HIVE_AUX_JARS_PATH=/usr/hdp/2.2.0.0-2041/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar

Note that you may or may not have to change the setting for HIVE_AUX_JARS_PATH. This is for Hive udf’s and custom SerDe’s.

Note there are three locations for hive-env.sh

/etc/hive/conf.dist/hive-env.sh
/etc/hive/conf.server/hive-env.sh
/etc/hive/conf/hive-env.sh

If you’re using Ambari – there is a template for hive-env – set the paths here otherwise you’ll find your changes stepped on when the service is restarted.

Decommission Edge Node and Upgrade Hadoop via Ambari

December 11, 2014 in Hadoop | Leave a comment

Upgrading an existing data node in a Hadoop cluster can be a difficult task. Edge nodes that do not have hdfs components and are purely for client access can be upgraded by decommissioning the node and redeploying with Ambari. Decommission the node first in Ambari. This doe not remove the software. Run yum remove on all components;
This is overkill;
yum remove hcatalog\* yum remove hive\* yum remove hbase\* yum remove zookeeper\* yum remove oozie\* yum remove pig\* yum remove knox\* yum remove snappy\* yum remove hadoop-lzo\* yum remove hadoop\* yum remove extjs-2.2-1 mysql-connector-java-5.0.8-1\* yum erase ambari-agent yum erase ambari-server

In deploying the first time, you should have set up password-less ssh on the client node;

https://ambari.apache.org/1.2.1/installing-hadoop-using-ambari/content/ambari-chap1-5-2.html

Go get the same private key from the node running Ambari Server. Use this to re deploy via Ambari. You should get some warnings once the connection is made – use the python cleanup;

python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py --silent --skip=users

script to clean things up in bulk and then re-run the host checks.

These users need to be removed;
userdel ambari-qa userdel oozie userdel hcat userdel hive userdel yarn userdel hdfs userdel nagios userdel mapred userdel zookeeper userdel tez userdel rrdcached userdel falcon userdel sqoop

Fix Spark Hive error ; java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.security.ProxyUserAuthenticator

December 9, 2014 in Apache Hive, Hadoop, Spark | Leave a comment

When creating a Apache Spark Hive context
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

in spark1.1.1 I ran into the following exception;

java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.security.ProxyUserAuthenticator

It turns out that ProxyUserAuthenticator was
introduced in hive 0.13, but Spark 1.1.1 SQL is based on hive 0.12. To fix the error locate the hive.security.authenticator.manager setting in hive-site.xml and change the value from org.apache.hadoop.hive.ql.security.ProxyUserAuthenticator to org.apache.hadoop.hive.ql.security.HadoopDefaultAuthentication

Apache Spark / MLLib – Setting up Native ATLAS With netlib-java

December 4, 2014 in java, Linear Algebra, Linux, Machine Learning, Numerical | 1 comment

To get the best performance out of Spark / MLLib you need to ensure you’re using the native BLAS / LAPACK.

I this post we review the caveats of java – native interop as it applies to Spark / MLLIb. Spark is built on Scala. MLLIb is a Spark Machine Learning library that utilizes Breeze. Breeze is built on netlib-java and jblas. netlib-java is a wrapper for low-level BLAS, LAPACK and ARPACK that performs as fast as the C / Fortran interfaces.

netlib-java uses BLAS/LAPACK/ARPACK from:
1) delegating builds that use machine optimized system libraries
2) self-contained native builds using the reference Fortran from [netlib.org](http://www.netlib.org)
3) Java [F2J](http://icl.cs.utk.edu/f2j/) to ensure full portability on the JVM

The [JNILoader](https://github.com/fommil/jniloader) will attempt to load the implementations in this order automatically. The last option should be avoided for performance reasons – with some caveates about interop to be discussed below.

1) yum install atlas-devel
This will install the native BLAS LAPACK so’s.

These are only generic pre-tuned builds. To get optimal performance for a specific machine, it is best to compile Atlas locally

http://sourceforge.net/projects/math-atlas/files/latest/download

Install the shared libraries into a folder that is seen by the runtime linker (e.g. add your install folder to `/etc/ld.so.conf` then run `ldconfig`) ensuring that `libblas.so.3` and `liblapack.so.3`
exist and point to your optimal builds.

The Intel MKL may also be used by creating symbolic links from `libblas.so.3` and `liblapack.so.3` to `libmkl_rt.so`.

netlib-java is deployed as arpack_combined_all-0.1.jar with the java implementations. To use native so’s we might [I have conflicting documentation on this right now] have to build the this and the jni library.

Get netlib-java
git clone https://github.com/fommil/netlib-java.git netlib-java

Get Breeze
https://github.com/scalanlp/breeze.git

Building Breeze requires sbt -a Scala build tool.

Some reading on the JNI overhead;

http://blog.mikiobraun.de/2008/10/matrices-jni-directbuffers-and-number.html

http://mikiobraun.blogspot.com/2008/08/benchmarking-javac-vs-ecj-on-array.html

Now to getting the right BLAS LAPACK installed and connected to netlib

Out of the box –
import com.github.fommil.netlib.BLAS; .. System.out.println(BLAS.getInstance().getClass().getName());
will give you this

Dec 05, 2014 8:10:27 AM com.github.fommil.netlib.BLAS
WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
Dec 05, 2014 8:10:27 AM com.github.fommil.netlib.BLAS
WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
com.github.fommil.netlib.F2jBLAS

Installing Spark and integrating with Hortonworks Hadoop 2.1

December 3, 2014 in Hadoop, Spark | Leave a comment

The first step is to realize that HDP 2.1 is Apache Hadoop 2.4. Get Spark here; https://spark.apache.org/downloads.html

Spark is build on scala, so get scala;
http://www.scala-lang.org/download/2.9.3.html

Set $SCALA_HOME environment variable i.e.;
export SCALA_HOME=/usr/share/scala-2.9.3

and and run the Spark build from the root of spark directory;
./sbt/sbt -Dhadoop.version=2.4.0 -Pyarn assembly
This step takes some time.

To use the netlib-java JNI wrapper to Native Blas
build with -Pnetlib-lgpl

Now Tell Spark where your yarn-site is;
export YARN_CONF_DIR=/etc/hadoop/conf

Spark MLlib notes

November 16, 2014 in Hadoop, Linux, Spark | Leave a comment

MLLis is an alternative to Mahout for developing distributed machine learning algorithms on Hadoop.

Some of the features;
• Sparse data
• Classification and regression tree (CART)
• linear models (SVMs, logistic regression, linear regression)
• decision trees
• SVD and PCA
• L-BFGS
• Model evaluation
• Discretization

Mahout has plans to integrate with Spark in the future via Scala & Spark Bindings, but for now, Mahout used the YARN engine.
Scala & Spark Bindings for Mahout is a Scala DSL algebraic optimizer bound to in-core and distributed computations
Mahout Scala & Spark Bindings expression of the above:

MLlib uses the linear algebra package Breeze, which depends on netlib-java, and jblas. netlib-java.

jblas in turn depend on native Fortran routines. To run MLLIb we’ll the gfortran runtime library

To use native libraries from netlib-java, we build Spark with -Pnetlib-lgpl or include com.github.fommil.netlib:all:1.1.2 as a dependency of your project.

“If you want to use optimized BLAS/LAPACK libraries such as OpenBLAS, please link its shared libraries to /usr/lib/libblas.so.3 and /usr/lib/liblapack.so.3, respectively. BLAS/LAPACK libraries on worker nodes should be built without multithreading.”

MLlib can be used in Python via NumPy (use version 1.4 or newer).

JBLAS;
http://mikiobraun.github.io/jblas/

I’m curious whether the Intel MKL can be used for native BLAS / LAPACK.

Remove and Re-Install Hortonworks Hadoop 2.1

November 14, 2014 in Hadoop, Linux | Leave a comment

If Ambari is running decommission all nodes, and follow these steps.

To completely remove all components of HW 2.1 Hadoop distribution;
<code>yum remove hcatalog\*
yum remove hive\*
yum remove hbase\*
yum remove zookeeper\*
yum remove oozie\*
yum remove pig\*
yum remove knox\*
yum remove snappy\*
yum remove hadoop-lzo\*
yum remove hadoop\*
yum remove extjs-2.2-1 mysql-connector-java-5.0.8-1\*
yum erase ambari-agent
yum erase ambari-server</code>

Then to reinstall we execute
wget http://public-repo-1.hortonworks.com/ambari/centos5/1.x/updates/1.6.1/ambari.repo cp ambari.repo /etc/yum.repos.d yum install ambari-server ambari-server setup ambari-server start ambari-server status

If you get
SELinux status is ‘enabled’
SELinux mode is ‘permissive’
WARNING: SELinux is set to ‘permissive’ mode and temporarily disabled.
OK to continue [y/n] (y)?

To permanently turn off seLinux;
From the command line, you can edit the /etc/sysconfig/selinux file. This file is a symlink to /etc/selinux/config. The configuration file is self-explanatory. Changing the value of SELINUX or SELINUXTYPE changes the state of SELinux and the name of the policy to be used the next time the system boots.

There are Ambari config files left about after removal of the service, if you run into trouble with the installation procedure below, then grep for ambari config files and remove them.

Also,try this
ambari-server reset

Error: Package: ambari-server-1.6.1-98.noarch (Updates-ambari-1.6.1) Requires: python26

November 14, 2014 in Hadoop | Leave a comment

If this error is encountered in CentOS install of Hortonworks via Ambari, then you will need to celan out your yum metadta cache. During its normal use yum creates a cache of metadata and packages. This cache can take up a lot of space. The yum clean command allows you to clean up these files. All the files yum clean will act on are normally stored in /var/cache/yum
yum clean all

CentOS cifs mount error : cannot allocate memory

November 3, 2014 in Linux | Leave a comment

To overcome the Samaba mount error
mount error(12): Cannot allocate memory<
Refer to the mount.cifs(8) manual page (e.g. man mount.cifs)

on CentOS connected to Windows 7 one can restart the Server service. This only provides a temporary fix. Also, the memory error can occur in the middle of a large file transfer. A more permanent fix is to make the following registry modifications.

Run RegEdit.exe;

Navigate to
HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management
Set LargeSystemCache key to 1

Navigate to
HKLM\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters
Set Size to 3

Restart the Server service.

Linux Networking for Hadoop, Hue, & RServer

October 14, 2014 in Hadoop, Linux | Leave a comment

These are notes from getting ports open and setup for Hadoop

Turn Off Firewall;
service iptables save service iptables stop chkconfig iptables off

Hive is supposed to be running on 10000, but I found this;
nmap localhost ... 10000/tcp open snet-sensor-mgmt

See what app is listening on port (8080 for example);
lsof -i :8080

Testing Ports – use nc – the command which runs netcat
Can you listenat 9000? Try nc -l 9000 and in another console nc localhost 9000, then all input will be echoed in the other console if the connection is good. Here is Ambari responding
nc localhost 8080 Hi HTTP/1.1 400 Bad Request Content-Length: 0 Connection: close Server: Jetty(7.6.7.v20120910)

CentOS 6.5 wxHexEditor build instructions

October 1, 2014 in Linux | Leave a comment

wxHexEditor is a good Linux replacement for HxD.

sudo yum install libtool gcc-c++ wxGTK-devel svn checkout svn://svn.code.sf.net/p/wxhexeditor/code/trunk wxHexEditor cd wxHexEditor/ make OPTFLAGS="-fopenmp"

Accessing Hive Table Data with MapReduce

September 30, 2014 in Hadoop, Linux | Leave a comment

I've been struggling with accessing Hive table data from MR. There were several stumbling blocks;



Making sure that the right libraries were being used by Maven.
Getting the correct hive-site.xml picked up by the configuration mechanism.
Sorting out differences between the old JobConf api and the and new YARN job api.

JobConf and everything in org.apache.hadoop.mapred package is part of the old API used to write hadoop jobs, Job and everything in the org.apache.hadoop.mapreduce package is  the new and  API to write hadoop jobs.  Main points here are to change mapred packages to mapreduce and use Configuration instead of JobConf.
Here's a good reference :
http://hadoopbeforestarting.blogspot.de/2012/12/difference-between-hadoop-old-api-and.html
Add the path to a hive-site.xml to the project java class path if you're developing in Eclipse and want to run in standalone / pseudo-distributed mode.
Initially, it  was not clear how to set up the Maven dependencies for this project.  I needed to pull in Hive, but the Hortonworks repo did not have the right version. I ended up using Hortonworks and Maven Central, pulling the hive dependencies from Maven Central.
I'll revisit this when it's all up and running.
 
Here are the Hive dependencies



org.apache.hadoop

hadoop-client

2.4.0.2.1.5.0-695
org.apache.hive.hcatalog

hive-hcatalog-core

0.13.1
And the repos;



Maven Repository Switchboard

http://repo1.maven.org/maven2
HDPReleases

HDP Releases

http://repo.hortonworks.com/content/repositories/releases/
Update: I keep running into a compatibility problem with my Hive MapReduce read code;

java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected

From what I gather, the error comes up form mixing MR 1 and YARN code, but it 's not clear if that's the cause of this problem. I tried to run the HiveRead job outside of Eclipse to see if the error went away. Here are some step I had to take to get the Hive variables set up.
source /etc/alternatives/hadoop-conf/hadoop-env.sh

source /etc/alternatives/hive-conf/hive-env.sh

export HIVE_HOME=/usr/lib/hive

export PATH=$PATH:$HIVE_HOME/bin

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/usr/lib/hive/lib:/etc/alternatives/hive-conf
NOTES: I'm using maven /Eclipse
This error is encountered when building the jar without including dependencies

hadoop jar HiveRead.jar com.bcampbell.hadoopproject.App :

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf


	
		
			HDFS capacity utilization CRIT for 2 minutes CRITICAL Error – Single Node Hortonworks 2.1  Setup on CentOs 6.5
			September 24, 2014 in Hadoop, Linux | Leave a comment			
		

		
			I got this error after adding data to hdfs.  Additional space was allocated, but the region server did not recognize it.
Run this to check the files system;
hadoop fsck /user
The issue is that the  hdfs filesystem was set up on /hadoop which was on a partition with very little space. When I tried to add a data directory in /home, Ambari complained.  So, to keep it simple I added an additional disk and put the hadoop data there.

Here are commands to explore disks and partitions, create a new partition, and set the permissions.  I’ll revisit later to comment ;
fdisk -l

pvs

vgs

lvs

/sbin/mkfs.ext3 -L /hadoopdata /dev/sdb

mkdir /hadoopdata

mount -t ext3 /dev/sdb /hadoopdata/

stat -c "%a" /hadoop

chmod 755 /hadoopdata/
Update:

For some reason Ambari would not pick up the additional data directory so I had to resort to an online expansion of the original LVM volume.

I re-purposed the extra drive as an LVM partition

(http://www.linuxuser.co.uk/features/resize-your-disks-on-the-fly-with-lvm)

fdisk /dev/sdb  (Here make a primary partition with same filesytem (ext4 as the other LVM partitions)

pvcreate /dev/sdb1

vgextend VolGroup /dev/sdb1

lvextend -L +196G /dev/mapper/VolGroup-lv_root



At this point I thought I was all done, but the filesystem needs to be expanded to see the extra space. This is dangerous, especially since it was root. but it went well.
resize2fs /dev/mapper/VolGroup-lv_root
Lastly, I want to point out that the hdfs-site.xml setting

fds.datanode.du.reserved

is only used for checks, I was confused and thought that it would reserve space. Setting this to 0 removes the check.
		
			
	
		
			IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try
			September 23, 2014 in Hadoop, Linux | Leave a comment			
		

		
			When deploying a stand alone Hadoop you may encounter the bad data node error. I was able to ingest data through hdfs, but when inserting a file larger than one block, I encountered this error.  To resolve this set the replication to 1.  The default is 3 – the minimum for true high availability fail-over capability.  I set up my environment for development only, so

In the hdfs-site.xml file set



dfs.replication

1

 

Note, if you’re using Ambari, do this through the config section of the hdfs service.
		
			
	
		
			Install Hue on Hortonworks Hadoop 2.1
			September 23, 2014 in Hadoop, Linux | Leave a comment			
		

		
			yum install hue
Follow the setups instructions at;

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap-hue.html 
/etc/init.d/hue start
If there are problems with the port Dx;
iptables -A INPUT -p tcp –dport 8000 -j ACCEPT

netstat -an | grep 8000

nc 127.0.0.1 8000 < /dev/null; echo $?
I had to change the port to 8087 to get it to work on CentOS 6.5. Edit /etc/hue/conf/hue.ini for this.
		
			
	
		
			Setup Standalone Hortonworks 2.1 stack on CenOS .65
			September 22, 2014 in Hadoop, Linux | Leave a comment			
		

		
			Follow The directions in the Hortoworks Ambari Doc
Use the fully qualified name from

hostname -f
Make sure it’s in

/etc/hosts

and

/etc/sysconfig/network
Execute

setenforce 0
Check

umask 0022
In

/etc/yum/pluginconf.d/refresh-packagekit.conf

set

enabled=0
There is an openSSL bug in CenOS6.5

If you can not connect to the ambari-agent during set up

update

rpm -qa | grep openssl

yum upgrade openssl
For Manual Ambari Agent connection follow these steps
yum install epel-release

yum install ambari-agent
vi /etc/ambari-agent/conf/ambari-agent.ini
[server]

hostname={your.ambari.server.hostname}

url_port=4080

secured_url_port=8443
ambari-agent start
Once you connect if there are issues Ambari will report them- try to run the python script

python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py --silent --skip=users
NTPD Setup:Install and Configure NTP to Synchronize The System Clock

yum install ntp ntpdate ntp-doc

chkconfig ntpd on

ntpdate pool.ntp.org

/etc/init.d/ntpd start
turn Off iptables while Ambari runs – turn on later

/etc/init.d/iptables stop
		
			
	
		
			Mahout and R Studio Server CentOS 6.5 / RHEL 6.6
			September 21, 2014 in CRAN R, Hadoop, Linux | Leave a comment			
		

		
			Here are some setup instructions for Mahout and R Studio Server on Hortonworks Sandbox.  
Run the commands below on the Sandbox VM.


yum install mahout

rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm

yum -y install git wget R

ln -s /etc/default/hadoop /etc/profile.d/hadoop.sh

cat /etc/profile.d/hadoop.sh | sed 's/export //g' > ~/.Renviron

wget http://download2.rstudio.org/rstudio-server-0.97.332-x86_64.rpm

sudo yum install --nogpgcheck rstudio-server-0.97.332-x86_64.rpm


The default port RStudio server runs on is 8787; you can determine the IP address of the VM using ifconfig. Open a browser on the host OS and navigate to :8787
If you’re using the Oracle VM VirtualBox and R Studio won’t open in the browser then, navigate to network settings and do a port forward on 8787.
Check and run the server

sudo rstudio-server verify-installation
UPDATE – 1/2015 For installation on RHEL 6.6 you may need to install the

following;
wget http://mirror.centos.org/centos/6/os/x86_64/Packages/lapack-devel-3.2.1-4.el6.x86_64.rpm http://mirror.centos.org/centos/6/os/x86_64/Packages/blas-devel-3.2.1-4.el6.x86_64.rpm http://mirror.centos.org/centos/6/os/x86_64/Packages/libicu-devel-4.2.1-9.1.el6_2.x86_64.rpm http://mirror.centos.org/centos/6/os/x86_64/Packages/texinfo-tex-4.13a-8.el6.x86_64.rpm

sudo yum localinstall *.rpm
		
			
	
		
			CDH5 Integration with Eclipse
			September 8, 2014 in Hadoop, Linux | Leave a comment			
		

		
			First install Maven – the entire Apache ecosystem is built with Maven.  Then we’ll be editing the pom.xml. 
To set up a Maven project run this;

 mvn archetype:generate \

      -DarchetypeGroupId=org.apache.maven.archetypes \

      -DarchetypeArtifactId=maven-archetype-quickstart \

      -DgroupId=com.bcampbell.hadoopproject \

      -DartifactId=wordcount

Then edit the pom file that’s generated. This will then in turn be used to generate an Eclipse workspace.
Here’s the pom file;








  





<project xmlns="http://maven.apache.org/POM/4.0.0&quot; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance&quot;



  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"&gt;



  <modelVersion>4.0.0</modelVersion>








  <groupId>com.bcampbell.hadoopproject</groupId>



  <artifactId>wordcount</artifactId>



  <version>0.0.1</version>



  <packaging>jar</packaging>








  <name>wordcount</name>



  <url>http://maven.apache.org</url&gt;








  <properties>



    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>



    <hadoop.version>2.3.0-cdh5.1.2</hadoop.version>



  </properties>








  <build>



    <pluginManagement>



      <plugins>



           <plugin>



            <groupId>org.apache.maven.plugins</groupId>



            <artifactId>maven-compiler-plugin</artifactId>



            <version>2.3.2</version>



            <configuration>



              <source>1.7</source>



              <target>1.7</target>



            </configuration>



          </plugin>



      </plugins>



    </pluginManagement>



 



    <plugins>








<plugin>



      <groupId>org.apache.maven.plugins </groupId>



      <artifactId>maven-eclipse-plugin</artifactId>



      <version>2.9</version>



      <configuration>



        <projectNameTemplate>



          ${project.artifactId}



        </projectNameTemplate>



        <buildOutputDirectory>



          eclipse-classes



        </buildOutputDirectory>



        <downloadSources>true</downloadSources>



        <downloadJavadocs>false</downloadJavadocs>



      </configuration>



    </plugin>








      <plugin>



        <groupId>org.apache.maven.plugins</groupId>



        <artifactId>maven-shade-plugin</artifactId>



        <version>1.7.1</version>



        <executions>



          <execution>



            <phase>package</phase>



            <goals>



              <goal>shade</goal>



            </goals>



          </execution>



        </executions>



      </plugin>



 



      <plugin>



        <groupId>org.apache.maven.plugins</groupId>



        <artifactId>maven-eclipse-plugin</artifactId>



        <version>2.9</version>



        <configuration>



          <buildOutputDirectory>eclipse-classes</buildOutputDirectory>



          <downloadSources>true</downloadSources>



          <downloadJavadocs>false</downloadJavadocs>



        </configuration>



      </plugin>



    </plugins>



  </build>













  <dependencies>



    <dependency>



      <groupId>junit</groupId>



      <artifactId>junit</artifactId>



      <version>3.8.1</version>



      <scope>test</scope>



    </dependency>



    <dependency>



      <groupId>org.apache.hadoop</groupId>



      <artifactId>hadoop-client</artifactId>



      <version>${hadoop.version}</version>



      <scope>provided</scope>



    </dependency>



  </dependencies>













<repositories>



    <repository>



      <id>cloudera</id>



      <url>https://repository.cloudera.com/artifactory/cloudera-repos</url&gt;



      <releases>



        <enabled>true</enabled>



      </releases>



      <snapshots>



        <enabled>false</enabled>



      </snapshots>



    </repository>



  </repositories>








</project>






        view raw

        

          gistfile1.txt

        

        hosted with ❤ by GitHub
      


Check the config and generate a build;

mvn validate

mvn compile

mvn package
After that – generate the eclipse workspace;

mvn -Declipse.workspace=eclipse_workspace   eclipse:configure-workspace eclipse:eclipse
And Let Eclipse know about Maven

Window -> Preferences

Java -> Build Path -> Classpath Variables -> New

name will be M2_REPO

path will be something like ~/.m2

Click the OK button twice
Use the command like to check the setup;

java -cp wordcount-0.0.1.jar com.bcampbell.hadoopproject.App
Run the jar with the wordcount

hadoop jar wordcount-0.0.1.jar com.bcampbell.hadoopproject.WordCount /user/bcampbell/input output44
Sadly I’m getting many missing libraries in the Eclipse workspace.  These are specified in Maven but did not get downloaded.  They are all the Hadoop jars.  There was a step for this, but I think I go think I got it wrong. I’ll update later when I get this working. 
UPDATE – I changed the hadoop dependency to; 
     org.apache.hadoop

     hadoop-client

     2.3.0-cdh5.1.2
This resolved the missing jars in the .m2 repository directory.
		
			
	
		
			Running the hadoop-mapreduce-examples.jar
			September 7, 2014 in Hadoop, Linux | Leave a comment			
		

		
			To create directories in my install I had to three step ;

[root@localhost bcampbell]# sudo -u hdfs hadoop fs -mkdir /user

[root@localhost bcampbell]# sudo -u hdfs hadoop fs -mkdir /user/bcampbell

[root@localhost bcampbell]# sudo -u hdfs hadoop fs -mkdir /user/bcampbell/input

[root@localhost bcampbell]# sudo -u hdfs hadoop fs -chown bcampbell /user/bcampbell

[root@localhost bcampbell]# sudo -u hdfs hadoop fs -chown bcampbell /user/bcampbell/input
First take a look to see what’s in the jar

[bcampbell@localhost input]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.1.2.jar

An example program must be given as the first argument.

Valid program names are:

  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.

  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.

  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.

  dbcount: An example job that count the pageview counts from a database.

  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.

  grep: A map/reduce program that counts the matches of a regex in the input.

  join: A job that effects a join over sorted, equally partitioned datasets

  multifilewc: A job that counts words from several files.

  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.

  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.

  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.

  randomwriter: A map/reduce program that writes 10GB of random data per node.

  secondarysort: An example defining a secondary sort to the reduce.

  sort: A map/reduce program that sorts the data written by the random writer.

  sudoku: A sudoku solver.

  teragen: Generate data for the terasort

  terasort: Run the terasort

  teravalidate: Checking results of terasort

  wordcount: A map/reduce program that counts the words in the input files.

  wordmean: A map/reduce program that counts the average length of the words in the input files.

  wordmedian: A map/reduce program that counts the median length of the words in the input files.

  wordstandarddeviatio
Here’s the setup and output for running the grep example;








  





 hadoop fs -put /etc/hadoop/conf/*.xml input



[bcampbell@localhost ~]$ hadoop fs -ls input



Found 7 items



-rw-r–r–   1 bcampbell supergroup     507105 2014-09-07 15:55 input/Milton_ParadiseLost.txt



-rw-r–r–   1 bcampbell supergroup     246679 2014-09-07 15:55 input/WilliamYeats.txt



-rw-r–r–   1 bcampbell supergroup       2133 2014-09-07 15:58 input/core-site.xml



-rw-r–r–   1 bcampbell supergroup       2324 2014-09-07 15:58 input/hdfs-site.xml



-rw-r–r–   1 bcampbell supergroup     246679 2014-09-07 15:56 input/inputWC



-rw-r–r–   1 bcampbell supergroup       1549 2014-09-07 15:58 input/mapred-site.xml



-rw-r–r–   1 bcampbell supergroup       2375 2014-09-07 15:58 input/yarn-site.xml



[bcampbell@localhost ~]$  hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+'



14/09/07 16:00:07 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032



14/09/07 16:00:07 WARN mapreduce.JobSubmitter: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).



14/09/07 16:00:07 INFO input.FileInputFormat: Total input paths to process : 7



14/09/07 16:00:08 INFO mapreduce.JobSubmitter: number of splits:7



14/09/07 16:00:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1410054700839_0002



14/09/07 16:00:09 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.



14/09/07 16:00:09 INFO impl.YarnClientImpl: Submitted application application_1410054700839_0002



14/09/07 16:00:09 INFO mapreduce.Job: The url to track the job: http://localhost.localdomain:8088/proxy/application_1410054700839_0002/



14/09/07 16:00:09 INFO mapreduce.Job: Running job: job_1410054700839_0002



14/09/07 16:00:18 INFO mapreduce.Job: Job job_1410054700839_0002 running in uber mode : false



14/09/07 16:00:18 INFO mapreduce.Job:  map 0% reduce 0%



14/09/07 16:00:23 INFO mapreduce.Job:  map 29% reduce 0%



14/09/07 16:00:24 INFO mapreduce.Job:  map 43% reduce 0%



14/09/07 16:00:25 INFO mapreduce.Job:  map 57% reduce 0%



14/09/07 16:00:26 INFO mapreduce.Job:  map 100% reduce 0%



14/09/07 16:00:30 INFO mapreduce.Job:  map 100% reduce 100%



14/09/07 16:00:30 INFO mapreduce.Job: Job job_1410054700839_0002 completed successfully



14/09/07 16:00:30 INFO mapreduce.Job: Counters: 49



	File System Counters



		FILE: Number of bytes read=330



		FILE: Number of bytes written=740425



		FILE: Number of read operations=0



		FILE: Number of large read operations=0



		FILE: Number of write operations=0



		HDFS: Number of bytes read=1009700



		HDFS: Number of bytes written=470



		HDFS: Number of read operations=24



		HDFS: Number of large read operations=0



		HDFS: Number of write operations=2



	Job Counters 



		Launched map tasks=7



		Launched reduce tasks=1



		Data-local map tasks=7



		Total time spent by all maps in occupied slots (ms)=20069



		Total time spent by all reduces in occupied slots (ms)=3482



		Total time spent by all map tasks (ms)=20069



		Total time spent by all reduce tasks (ms)=3482



		Total vcore-seconds taken by all map tasks=20069



		Total vcore-seconds taken by all reduce tasks=3482



		Total megabyte-seconds taken by all map tasks=20550656



		Total megabyte-seconds taken by all reduce tasks=3565568



	Map-Reduce Framework



		Map input records=27113



		Map output records=10



		Map output bytes=304



		Map output materialized bytes=366



		Input split bytes=856



		Combine input records=10



		Combine output records=10



		Reduce input groups=10



		Reduce shuffle bytes=366



		Reduce input records=10



		Reduce output records=10



		Spilled Records=20



		Shuffled Maps =7



		Failed Shuffles=0



		Merged Map outputs=7



		GC time elapsed (ms)=323



		CPU time spent (ms)=6260



		Physical memory (bytes) snapshot=2039488512



		Virtual memory (bytes) snapshot=5680246784



		Total committed heap usage (bytes)=1610612736



	Shuffle Errors



		BAD_ID=0



		CONNECTION=0



		IO_ERROR=0



		WRONG_LENGTH=0



		WRONG_MAP=0



		WRONG_REDUCE=0



	File Input Format Counters 



		Bytes Read=1008844



	File Output Format Counters 



		Bytes Written=470



14/09/07 16:00:30 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032



14/09/07 16:00:30 WARN mapreduce.JobSubmitter: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).



14/09/07 16:00:30 INFO input.FileInputFormat: Total input paths to process : 1



14/09/07 16:00:30 INFO mapreduce.JobSubmitter: number of splits:1



14/09/07 16:00:30 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1410054700839_0003



14/09/07 16:00:30 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.



14/09/07 16:00:30 INFO impl.YarnClientImpl: Submitted application application_1410054700839_0003



14/09/07 16:00:30 INFO mapreduce.Job: The url to track the job: http://localhost.localdomain:8088/proxy/application_1410054700839_0003/



14/09/07 16:00:30 INFO mapreduce.Job: Running job: job_1410054700839_0003



14/09/07 16:00:37 INFO mapreduce.Job: Job job_1410054700839_0003 running in uber mode : false



14/09/07 16:00:37 INFO mapreduce.Job:  map 0% reduce 0%



14/09/07 16:00:43 INFO mapreduce.Job:  map 100% reduce 0%



14/09/07 16:00:49 INFO mapreduce.Job:  map 100% reduce 100%



14/09/07 16:00:50 INFO mapreduce.Job: Job job_1410054700839_0003 completed successfully



14/09/07 16:00:50 INFO mapreduce.Job: Counters: 49



	File System Counters



		FILE: Number of bytes read=330



		FILE: Number of bytes written=184533



		FILE: Number of read operations=0



		FILE: Number of large read operations=0



		FILE: Number of write operations=0



		HDFS: Number of bytes read=605



		HDFS: Number of bytes written=244



		HDFS: Number of read operations=7



		HDFS: Number of large read operations=0



		HDFS: Number of write operations=2



	Job Counters 



		Launched map tasks=1



		Launched reduce tasks=1



		Data-local map tasks=1



		Total time spent by all maps in occupied slots (ms)=3171



		Total time spent by all reduces in occupied slots (ms)=3435



		Total time spent by all map tasks (ms)=3171



		Total time spent by all reduce tasks (ms)=3435



		Total vcore-seconds taken by all map tasks=3171



		Total vcore-seconds taken by all reduce tasks=3435



		Total megabyte-seconds taken by all map tasks=3247104



		Total megabyte-seconds taken by all reduce tasks=3517440



	Map-Reduce Framework



		Map input records=10



		Map output records=10



		Map output bytes=304



		Map output materialized bytes=330



		Input split bytes=135



		Combine input records=0



		Combine output records=0



		Reduce input groups=1



		Reduce shuffle bytes=330



		Reduce input records=10



		Reduce output records=10



		Spilled Records=20



		Shuffled Maps =1



		Failed Shuffles=0



		Merged Map outputs=1



		GC time elapsed (ms)=58



		CPU time spent (ms)=2140



		Physical memory (bytes) snapshot=431476736



		Virtual memory (bytes) snapshot=1437347840



		Total committed heap usage (bytes)=402653184



	Shuffle Errors



		BAD_ID=0



		CONNECTION=0



		IO_ERROR=0



		WRONG_LENGTH=0



		WRONG_MAP=0



		WRONG_REDUCE=0



	File Input Format Counters 



		Bytes Read=470



	File Output Format Counters 



		Bytes Written=244



[bcampbell@localhost ~]$ hadoop fs -ls output23



Found 2 items



-rw-r–r–   1 bcampbell supergroup          0 2014-09-07 16:00 output23/_SUCCESS



-rw-r–r–   1 bcampbell supergroup        244 2014-09-07 16:00 output23/part-r-00000



[bcampbell@localhost ~]$ hadoop fs -cat output23/part-r-00000 | head



1	dfs.safemode.min.datanodes



1	dfs.safemode.extension



1	dfs.replication



1	dfs.namenode.name.dir



1	dfs.namenode.checkpoint.dir



1	dfs.domain.socket.path



1	dfs.datanode.hdfs



1	dfs.datanode.data.dir



1	dfs.client.read.shortcircuit



1	dfs.client.file



[bcampbell@localhost ~]$ 






        view raw

        

          gistfile1.txt

        

        hosted with ❤ by GitHub
      


		
			
	
		
			CDH5 YARN Install on Fedora 20
			September 7, 2014 in Hadoop, Linux | Leave a comment			
		

		
			First This








  





yum update



yum groupinstall "Books and Guides" "C Development Tools and Libraries" "Development Tools" "Fedora Eclipse" "System Tools" "Editors"



rpm -ivh jdk-7u67-linux-x64.rpm








#as su



alternatives –install /usr/bin/java java /usr/java/jdk1.7.0_67/jre/bin/java 2



alternatives –install /usr/bin/javaws javaws /usr/java/jdk1.7.0_67/jre/bin/javaws 2



alternatives –install /usr/bin/javac javac /usr/java/jdk1.7.0_67/bin/javac 2



alternatives –install /usr/bin/jar jar /usr/java/jdk1.7.0_67/bin/jar 2








#For Browswer Plugin – 



alternatives –install /usr/lib64/mozilla/plugins/libjavaplugin.so libjavaplugin.so.x86_64 /usr/java/jdk1.7.0_67/jre/lib/amd64/libnpjp2.so 2








alternatives –config java



alternatives –config javac








#Set JAVA_HOME in /etc/profile Make sure the JAVA_HOME environment variable is set for the root user on each node. You 



can check by using a command such as



$ sudo env | grep JAVA_HOME








export JAVA_HOME=/usr/java/jdk1.7.0_67



export PATH=$JAVA_HOME/bin:$PATH








#Lastly – if you want the JAVA_HOME set for all users – do this 



#in /etc/profile.d make a shell script to run the two commands;








export JAVA_HOME=/usr/java/jdk1.7.0_67



export PATH=$JAVA_HOME/bin:$PATH






        view raw

        

          gistfile1.txt

        

        hosted with ❤ by GitHub
      


Then from here;

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Quick-Start/cdh5qs_yarn_pseudo.html
sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm
Install the CDH – key 
rpm --import http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
then the business



sudo yum install hadoop-conf-pseudo
Check that things went well.

rpm -ql hadoop-conf-pseudo

/etc/hadoop/conf.pseudo

/etc/hadoop/conf.pseudo/README

/etc/hadoop/conf.pseudo/core-site.xml

/etc/hadoop/conf.pseudo/hadoop-env.sh

/etc/hadoop/conf.pseudo/hadoop-metrics.properties

/etc/hadoop/conf.pseudo/hdfs-site.xml

/etc/hadoop/conf.pseudo/log4j.properties

/etc/hadoop/conf.pseudo/mapred-site.xml

/etc/hadoop/conf.pseudo/yarn-site.xml
run these commands next
sudo -u hdfs hdfs namenode -format

for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done

sudo -u hdfs hadoop fs -rm -r /tmp

sudo -u hdfs hadoop fs -mkdir -p /tmp/hadoop yarn/staging/history/done_intermediate

sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

sudo -u hdfs hadoop fs -mkdir -p /var/log/hadoop-yarn

sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn

sudo -u hdfs hadoop fs -ls -R /

sudo service hadoop-yarn-resourcemanager start

sudo service hadoop-yarn-nodemanager start

sudo service hadoop-mapreduce-historyserver start
I’ve had a problem making a user directory that I must revisit.

I kept getting permission or no such file errors;
[root@localhost bcampbell]# sudo -u hdfs hadoop fs -mkdir “/home/bcampbell/input”

mkdir: `/home/bcampbell/input’: No such file or directory
This works;

sudo -u hdfs hadoop fs -mkdir /tmp/input

Possibly because it is a subdir of directory I already created. 
This works after I added the -p command

sudo -u hdfs hadoop fs -chown bcampbell /home/bcampbell/input/

[root@localhost bcampbell]#  sudo -u hdfs hadoop fs -chown bcampbell /home/bcampbell/hadoop
And then I was back in business. These all worked without issue. 
sudo -u hdfs hadoop fs -mkdir -p  /home/bcampbell/input

sudo -u hdfs hadoop fs -mkdir  /home/bcampbell/hadoop

sudo -u hdfs hadoop fs -chown bcampbell /home/bcampbell/input/

sudo -u hdfs hadoop fs -chown bcampbell /home/bcampbell/hadoop
If all is well, add a file and do a check;

[bcampbell@localhost ~]$ hadoop fs -copyFromLocal /home/bcampbell/Downloads/Milton_ParadiseLost.txt /home/bcampbell/input

[bcampbell@localhost ~]$ hdfs fsck /home/bcampbell/input/ -files -blocks

14/09/07 10:33:57 WARN ssl.FileBasedKeyStoresFactory: The property 'ssl.client.truststore.location' has not been set, no TrustStore will be loaded

Connecting to namenode via http://localhost:50070

FSCK started by bcampbell (auth:SIMPLE) from /127.0.0.1 for path /home/bcampbell/input/ at Sun Sep 07 10:33:58 EDT 2014

/home/bcampbell/input/

/home/bcampbell/input/Milton_ParadiseLost.txt 507105 bytes, 1 block(s):  OK

0. BP-1947923207-127.0.0.1-1410054283948:blk_1073741826_1002 len=507105 repl=1

Status: HEALTHY

 Total size:	507105 B

 Total dirs:	1

 Total files:	1

 Total symlinks:		0

 Total blocks (validated):	1 (avg. block size 507105 B)

 Minimally replicated blocks:	1 (100.0 %)

 Over-replicated blocks:	0 (0.0 %)

 Under-replicated blocks:	0 (0.0 %)

 Mis-replicated blocks:		0 (0.0 %)

 Default replication factor:	1

 Average block replication:	1.0

 Corrupt blocks:		0

 Missing replicas:		0 (0.0 %)

 Number of data-nodes:		1

 Number of racks:		1

FSCK ended at Sun Sep 07 10:33:58 EDT 2014 in 14 milliseconds
The filesystem under path '/home/bcampbell/input/' is HEALTHY
Next steps is to run a word count example- with YARN.
		
			
	
		
			Getting Fedorda 20 Java Config  Ready for CDH4
			September 6, 2014 in Hadoop, java, Linux | Leave a comment			
		

		
			I’ve had a hard time getting the java setup correct for running CDH4 on Fedora / RHEL. Fedora comes with OpenJDK and CDH requires the Oracle jdk. My basic understanding is that one installs the Oracle jdk and then sets up links via the “alternatives” command.

Here are some notes – this is wip until I get it all running








  





————————–



Swap between OpenJDK and Sun/Oracle Java JDK/JRE








alternatives –config java



alternatives –config javaws



alternatives –config libjavaplugin.so



alternatives –config libjavaplugin.so.x86_64



alternatives –config javac








Post-Installation Setup



Add JAVA_HOME environment variable to /etc/profile file or $HOME/.bash_profile



Java JDK and JRE latest version (/usr/java/latest)



You can make a simbolic link from your Java JDK to that directory, so it makes it easier to handle several versions of java








# export JAVA_HOME JDK/JRE








export JAVA_HOME="/usr/java/latest"



Java JDK and JRE absolute version (/usr/java/jdk1.7.0_09) or other version








# export JAVA_HOME JDK



export JAVA_HOME="/usr/java/jdk1.7.0_09" 













# export JAVA_HOME JRE #  



export JAVA_HOME="/usr/java/jre1.7.0_09"



—————————————






        view raw

        

          gistfile1.txt

        

        hosted with ❤ by GitHub
      


Get the Oracle JDK – no neet to get the JRE it’s in the JDK.
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html
Install the rpm;
rpm -ivh jdk-7u67-linux-x64.rpm
Then set up the links

[root@localhost jvm]# alternatives --install /usr/bin/java java /usr/java/jdk1.7.0_67/jre/bin/java 2

[root@localhost jvm]# alternatives --install /usr/bin/javaws javaws /usr/java/jdk1.7.0_67/jre/bin/javaws 2

[root@localhost jvm]# alternatives --install /usr/bin/javac javac /usr/java/jdk1.7.0_67/bin/javac 2

[root@localhost jvm]# alternatives --install /usr/bin/jar jar /usr/java/jdk1.7.0_67/bin/jar 2

[root@localhost jvm]# alternatives --install /usr/lib64/mozilla/plugins/libjavaplugin.so libjavaplugin.so.x86_64 /usr/java/jdk1.7.0_67/jre/lib/amd64/libnpjp2.so 2

[root@localhost jvm]# alternatives --config java

There are 2 programs which provide 'java'.
  Selection    Command

-----------------------------------------------

*+ 1           /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65-2.5.1.3.fc20.x86_64/jre/bin/java

   2           /usr/java/jdk1.7.0_67/jre/bin/java
Enter to keep the current selection[+], or type selection number:

[root@localhost jvm]# alternatives --config javac
There are 2 programs which provide 'javac'.
  Selection    Command

-----------------------------------------------

*+ 1           /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65-2.5.1.3.fc20.x86_64/bin/javac

   2           /usr/java/jdk1.7.0_67/bin/javac
Enter to keep the current selection[+], or type selection number:

[root@localhost jvm]# alternatives --config javaws
#Set JAVA_HOME in /etc/profile Make sure the JAVA_HOME environment variable is set for the root user on each node. You can check by using a command such as

sudo env | grep JAVA_HOME
Lastly – if you want the JAVA_HOME set for all users – do this

in /etc/profile.d make a shell script to run the two commands;
export JAVA_HOME=/usr/java/jdk1.7.0_67

export PATH=$JAVA_HOME/bin:$PATH
There is a warning not to change the /etc/profile script – follow it. All the shell scripts in etc/profile.d get called when the profile script is run. 
		
			
	
		
			.Net Performance Counter
			September 6, 2014 in .NET | Leave a comment			
		

		
			I’ve implemented a .NET performance counter in Managed C++.  This post contains some details on the class and notes on installing the counter.
The source is in the klImageCore repository;
https://github.com/wavescholar/klImageCore
https://github.com/wavescholar/klImageCore/blob/master/klImageCoreV/inc/klPerformanceCounter.h
Doxygen documentation is at
http://wavescholar.github.io/klImageCore/classkl_counters_1_1kl_performance_counter.html
Installation of the counter requires registry access.  Usually this can be achieved by running the application or Visual Studio as administrator.  But if you don’t like these options, there are steps to take that will allow the assembly to edit the registry.    
		
			
	
		
			Running Hadoop Map-Reduce Word Count Example in version 2.5.0
			September 4, 2014 in Hadoop, Linux | Leave a comment			
		

		
			UPDATE – not related to the problem I was seeing – but there is no JobTracker when running with YARN.
When trying to run the Hadoop Example WordCount I encountered the error below. If this happens to you try to see if the JobTracker is running on port 8021 with this command; netstat -atn | grep LISTEN. There was nothing listening on that port so I’ll investigate and update later. See my post below – the web API gave me a response, but it was this error ; 
Incompatible shuffle request version

[bruce@localhost hadoopTest]$ hadoop jar /usr/share/java/hadoop/hadoop-mapreduce-examples.jar grep input output 'dfs[a-z.]+'

  14/09/03 20:50:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/09/03 20:50:51 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/09/03 20:50:51 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String). 14/09/03 20:50:51 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/bruce/.staging/job_1409694147925_0003 14/09/03 20:50:51 ERROR security.UserGroupInformation: PriviledgedActionException as:bruce (auth:SIMPLE) cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:8020/user/bruce/input org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:8020/user/bruce/input ... at org.apache.hadoop.util.RunJar.main(RunJar.java:212) input WordCount.java [bruce@localhost hadoopTest]$ netstat -atn | grep LISTEN tcp 0 0 127.0.0.1:8020 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN tcp 0 0 127.0.0.1:631 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:50075 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:50020 0.0.0.0:* LISTEN tcp6 0 0 :::8042 :::* LISTEN tcp6 0 0 ::1:631 :::* LISTEN tcp6 0 0 :::8088 :::* LISTEN tcp6 0 0 :::13562 :::* LISTEN tcp6 0 0 :::8030 :::* LISTEN tcp6 0 0 :::8031 :::* LISTEN tcp6 0 0 :::8032 :::* LISTEN tcp6 0 0 :::8033 :::* LISTEN tcp6 0 0 :::38951 :::* LISTEN tcp6 0 0 :::8040 :::* LISTEN







		Follow Bruce B Campbell on WordPress.com
		

		
Categories
.NET Amazon Cloud Apache Hive C++ CRAN R CUDA Data Science Digital Pathology EMR Github Graph Databases Hadoop Image Processing java JavaScript Linear Algebra Linux Machine Learning Mathematics Matlab Numerical Octopress R Security and Cryptography Software Engineering Spark Uncategorized 

	
		
		
	

		
		Recent Posts
		
											
					Hierarchical Bayesian Modelling
									
											
					Fix tidyverse missing gfortran error
									
											
					SublimeText3 remote editing
									
											
					RStudio hangs on EMR master node
									
											
					Hadoop Stress Test
									
					

		
Archives
			
					December 2018
	May 2018
	April 2018
	January 2018
	September 2015
	August 2015
	July 2015
	June 2015
	March 2015
	February 2015
	December 2014
	November 2014
	October 2014
	September 2014
	August 2014
	July 2014
	June 2014
	May 2014
	April 2014
	March 2014
	February 2014
	January 2014
	November 2013
	August 2013
	July 2013
	May 2013
	April 2013
	September 2012
	August 2012
	June 2012
	April 2012
	March 2012
	February 2012
	August 2011
	April 2011
	June 2009
	April 2007
			

			
Meta
		
			Register
			Log in
			Entries feed
			Comments feed

			WordPress.com

	<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>

	<groupId>com.bcampbell.hadoopproject</groupId>
	<artifactId>wordcount</artifactId>
	<version>0.0.1</version>
	<packaging>jar</packaging>

	<name>wordcount</name>
	<url>http://maven.apache.org</url>

	<properties>
	<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
	<hadoop.version>2.3.0-cdh5.1.2</hadoop.version>
	</properties>

	<build>
	<pluginManagement>
	<plugins>
	<plugin>
	<groupId>org.apache.maven.plugins</groupId>
	<artifactId>maven-compiler-plugin</artifactId>
	<version>2.3.2</version>
	<configuration>
	<source>1.7</source>
	<target>1.7</target>
	</configuration>
	</plugin>
	</plugins>
	</pluginManagement>

	<plugins>

	<plugin>
	<groupId>org.apache.maven.plugins </groupId>
	<artifactId>maven-eclipse-plugin</artifactId>
	<version>2.9</version>
	<configuration>
	<projectNameTemplate>
	${project.artifactId}
	</projectNameTemplate>
	<buildOutputDirectory>
	eclipse-classes
	</buildOutputDirectory>
	<downloadSources>true</downloadSources>
	<downloadJavadocs>false</downloadJavadocs>
	</configuration>
	</plugin>

	<plugin>
	<groupId>org.apache.maven.plugins</groupId>
	<artifactId>maven-shade-plugin</artifactId>
	<version>1.7.1</version>
	<executions>
	<execution>
	<phase>package</phase>
	<goals>
	<goal>shade</goal>
	</goals>
	</execution>
	</executions>
	</plugin>

	<plugin>
	<groupId>org.apache.maven.plugins</groupId>
	<artifactId>maven-eclipse-plugin</artifactId>
	<version>2.9</version>
	<configuration>
	<buildOutputDirectory>eclipse-classes</buildOutputDirectory>
	<downloadSources>true</downloadSources>
	<downloadJavadocs>false</downloadJavadocs>
	</configuration>
	</plugin>
	</plugins>
	</build>


	<dependencies>
	<dependency>
	<groupId>junit</groupId>
	<artifactId>junit</artifactId>
	<version>3.8.1</version>
	<scope>test</scope>
	</dependency>
	<dependency>
	<groupId>org.apache.hadoop</groupId>
	<artifactId>hadoop-client</artifactId>
	<version>${hadoop.version}</version>
	<scope>provided</scope>
	</dependency>
	</dependencies>


	<repositories>
	<repository>
	<id>cloudera</id>
	<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
	<releases>
	<enabled>true</enabled>
	</releases>
	<snapshots>
	<enabled>false</enabled>
	</snapshots>
	</repository>
	</repositories>

	</project>

	hadoop fs -put /etc/hadoop/conf/*.xml input
	[bcampbell@localhost ~]$ hadoop fs -ls input
	Found 7 items
	-rw-r–r– 1 bcampbell supergroup 507105 2014-09-07 15:55 input/Milton_ParadiseLost.txt
	-rw-r–r– 1 bcampbell supergroup 246679 2014-09-07 15:55 input/WilliamYeats.txt
	-rw-r–r– 1 bcampbell supergroup 2133 2014-09-07 15:58 input/core-site.xml
	-rw-r–r– 1 bcampbell supergroup 2324 2014-09-07 15:58 input/hdfs-site.xml
	-rw-r–r– 1 bcampbell supergroup 246679 2014-09-07 15:56 input/inputWC
	-rw-r–r– 1 bcampbell supergroup 1549 2014-09-07 15:58 input/mapred-site.xml
	-rw-r–r– 1 bcampbell supergroup 2375 2014-09-07 15:58 input/yarn-site.xml
	[bcampbell@localhost ~]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+'
	14/09/07 16:00:07 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
	14/09/07 16:00:07 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
	14/09/07 16:00:07 INFO input.FileInputFormat: Total input paths to process : 7
	14/09/07 16:00:08 INFO mapreduce.JobSubmitter: number of splits:7
	14/09/07 16:00:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1410054700839_0002
	14/09/07 16:00:09 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
	14/09/07 16:00:09 INFO impl.YarnClientImpl: Submitted application application_1410054700839_0002
	14/09/07 16:00:09 INFO mapreduce.Job: The url to track the job: http://localhost.localdomain:8088/proxy/application_1410054700839_0002/
	14/09/07 16:00:09 INFO mapreduce.Job: Running job: job_1410054700839_0002
	14/09/07 16:00:18 INFO mapreduce.Job: Job job_1410054700839_0002 running in uber mode : false
	14/09/07 16:00:18 INFO mapreduce.Job: map 0% reduce 0%
	14/09/07 16:00:23 INFO mapreduce.Job: map 29% reduce 0%
	14/09/07 16:00:24 INFO mapreduce.Job: map 43% reduce 0%
	14/09/07 16:00:25 INFO mapreduce.Job: map 57% reduce 0%
	14/09/07 16:00:26 INFO mapreduce.Job: map 100% reduce 0%
	14/09/07 16:00:30 INFO mapreduce.Job: map 100% reduce 100%
	14/09/07 16:00:30 INFO mapreduce.Job: Job job_1410054700839_0002 completed successfully
	14/09/07 16:00:30 INFO mapreduce.Job: Counters: 49
	File System Counters
	FILE: Number of bytes read=330
	FILE: Number of bytes written=740425
	FILE: Number of read operations=0
	FILE: Number of large read operations=0
	FILE: Number of write operations=0
	HDFS: Number of bytes read=1009700
	HDFS: Number of bytes written=470
	HDFS: Number of read operations=24
	HDFS: Number of large read operations=0
	HDFS: Number of write operations=2
	Job Counters
	Launched map tasks=7
	Launched reduce tasks=1
	Data-local map tasks=7
	Total time spent by all maps in occupied slots (ms)=20069
	Total time spent by all reduces in occupied slots (ms)=3482
	Total time spent by all map tasks (ms)=20069
	Total time spent by all reduce tasks (ms)=3482
	Total vcore-seconds taken by all map tasks=20069
	Total vcore-seconds taken by all reduce tasks=3482
	Total megabyte-seconds taken by all map tasks=20550656
	Total megabyte-seconds taken by all reduce tasks=3565568
	Map-Reduce Framework
	Map input records=27113
	Map output records=10
	Map output bytes=304
	Map output materialized bytes=366
	Input split bytes=856
	Combine input records=10
	Combine output records=10
	Reduce input groups=10
	Reduce shuffle bytes=366
	Reduce input records=10
	Reduce output records=10
	Spilled Records=20
	Shuffled Maps =7
	Failed Shuffles=0
	Merged Map outputs=7
	GC time elapsed (ms)=323
	CPU time spent (ms)=6260
	Physical memory (bytes) snapshot=2039488512
	Virtual memory (bytes) snapshot=5680246784
	Total committed heap usage (bytes)=1610612736
	Shuffle Errors
	BAD_ID=0
	CONNECTION=0
	IO_ERROR=0
	WRONG_LENGTH=0
	WRONG_MAP=0
	WRONG_REDUCE=0
	File Input Format Counters
	Bytes Read=1008844
	File Output Format Counters
	Bytes Written=470
	14/09/07 16:00:30 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
	14/09/07 16:00:30 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
	14/09/07 16:00:30 INFO input.FileInputFormat: Total input paths to process : 1
	14/09/07 16:00:30 INFO mapreduce.JobSubmitter: number of splits:1
	14/09/07 16:00:30 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1410054700839_0003
	14/09/07 16:00:30 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
	14/09/07 16:00:30 INFO impl.YarnClientImpl: Submitted application application_1410054700839_0003
	14/09/07 16:00:30 INFO mapreduce.Job: The url to track the job: http://localhost.localdomain:8088/proxy/application_1410054700839_0003/
	14/09/07 16:00:30 INFO mapreduce.Job: Running job: job_1410054700839_0003
	14/09/07 16:00:37 INFO mapreduce.Job: Job job_1410054700839_0003 running in uber mode : false
	14/09/07 16:00:37 INFO mapreduce.Job: map 0% reduce 0%
	14/09/07 16:00:43 INFO mapreduce.Job: map 100% reduce 0%
	14/09/07 16:00:49 INFO mapreduce.Job: map 100% reduce 100%
	14/09/07 16:00:50 INFO mapreduce.Job: Job job_1410054700839_0003 completed successfully
	14/09/07 16:00:50 INFO mapreduce.Job: Counters: 49
	File System Counters
	FILE: Number of bytes read=330
	FILE: Number of bytes written=184533
	FILE: Number of read operations=0
	FILE: Number of large read operations=0
	FILE: Number of write operations=0
	HDFS: Number of bytes read=605
	HDFS: Number of bytes written=244
	HDFS: Number of read operations=7
	HDFS: Number of large read operations=0
	HDFS: Number of write operations=2
	Job Counters
	Launched map tasks=1
	Launched reduce tasks=1
	Data-local map tasks=1
	Total time spent by all maps in occupied slots (ms)=3171
	Total time spent by all reduces in occupied slots (ms)=3435
	Total time spent by all map tasks (ms)=3171
	Total time spent by all reduce tasks (ms)=3435
	Total vcore-seconds taken by all map tasks=3171
	Total vcore-seconds taken by all reduce tasks=3435
	Total megabyte-seconds taken by all map tasks=3247104
	Total megabyte-seconds taken by all reduce tasks=3517440
	Map-Reduce Framework
	Map input records=10
	Map output records=10
	Map output bytes=304
	Map output materialized bytes=330
	Input split bytes=135
	Combine input records=0
	Combine output records=0
	Reduce input groups=1
	Reduce shuffle bytes=330
	Reduce input records=10
	Reduce output records=10
	Spilled Records=20
	Shuffled Maps =1
	Failed Shuffles=0
	Merged Map outputs=1
	GC time elapsed (ms)=58
	CPU time spent (ms)=2140
	Physical memory (bytes) snapshot=431476736
	Virtual memory (bytes) snapshot=1437347840
	Total committed heap usage (bytes)=402653184
	Shuffle Errors
	BAD_ID=0
	CONNECTION=0
	IO_ERROR=0
	WRONG_LENGTH=0
	WRONG_MAP=0
	WRONG_REDUCE=0
	File Input Format Counters
	Bytes Read=470
	File Output Format Counters
	Bytes Written=244
	[bcampbell@localhost ~]$ hadoop fs -ls output23
	Found 2 items
	-rw-r–r– 1 bcampbell supergroup 0 2014-09-07 16:00 output23/_SUCCESS
	-rw-r–r– 1 bcampbell supergroup 244 2014-09-07 16:00 output23/part-r-00000
	[bcampbell@localhost ~]$ hadoop fs -cat output23/part-r-00000 \| head
	1 dfs.safemode.min.datanodes
	1 dfs.safemode.extension
	1 dfs.replication
	1 dfs.namenode.name.dir
	1 dfs.namenode.checkpoint.dir
	1 dfs.domain.socket.path
	1 dfs.datanode.hdfs
	1 dfs.datanode.data.dir
	1 dfs.client.read.shortcircuit
	1 dfs.client.file
	[bcampbell@localhost ~]$

	yum update
	yum groupinstall "Books and Guides" "C Development Tools and Libraries" "Development Tools" "Fedora Eclipse" "System Tools" "Editors"
	rpm -ivh jdk-7u67-linux-x64.rpm

	#as su
	alternatives –install /usr/bin/java java /usr/java/jdk1.7.0_67/jre/bin/java 2
	alternatives –install /usr/bin/javaws javaws /usr/java/jdk1.7.0_67/jre/bin/javaws 2
	alternatives –install /usr/bin/javac javac /usr/java/jdk1.7.0_67/bin/javac 2
	alternatives –install /usr/bin/jar jar /usr/java/jdk1.7.0_67/bin/jar 2

	#For Browswer Plugin –
	alternatives –install /usr/lib64/mozilla/plugins/libjavaplugin.so libjavaplugin.so.x86_64 /usr/java/jdk1.7.0_67/jre/lib/amd64/libnpjp2.so 2

	alternatives –config java
	alternatives –config javac

	#Set JAVA_HOME in /etc/profile Make sure the JAVA_HOME environment variable is set for the root user on each node. You
	can check by using a command such as
	$ sudo env \| grep JAVA_HOME

	export JAVA_HOME=/usr/java/jdk1.7.0_67
	export PATH=$JAVA_HOME/bin:$PATH

	#Lastly – if you want the JAVA_HOME set for all users – do this
	#in /etc/profile.d make a shell script to run the two commands;

	export JAVA_HOME=/usr/java/jdk1.7.0_67
	export PATH=$JAVA_HOME/bin:$PATH

	————————–
	Swap between OpenJDK and Sun/Oracle Java JDK/JRE

	alternatives –config java
	alternatives –config javaws
	alternatives –config libjavaplugin.so
	alternatives –config libjavaplugin.so.x86_64
	alternatives –config javac

	Post-Installation Setup
	Add JAVA_HOME environment variable to /etc/profile file or $HOME/.bash_profile
	Java JDK and JRE latest version (/usr/java/latest)
	You can make a simbolic link from your Java JDK to that directory, so it makes it easier to handle several versions of java

	# export JAVA_HOME JDK/JRE

	export JAVA_HOME="/usr/java/latest"
	Java JDK and JRE absolute version (/usr/java/jdk1.7.0_09) or other version

	# export JAVA_HOME JDK
	export JAVA_HOME="/usr/java/jdk1.7.0_09"


	# export JAVA_HOME JRE #
	export JAVA_HOME="/usr/java/jre1.7.0_09"
	—————————————