Zen of Modelling
from http://modernstatisticalworkflow.blogspot.com
Your model should have some theoretical basis.
Your model, when simulated, should produce outcomes with a similar density to the observed values. Similarly, your model should not place weight on the impossible (like negative quantities, or binary outcomes that aren’t binary). It should place non-zero weight on possible but unlikely outcomes.
Think deeply about what is a random variable and what is not. A good rule of thumb: random variables are those things we do not know for certain out of sample. Your model is a joint density over the random variables.
You never have enough observations to distinguish one possible data generating process from another process that has different implications. You should model both, giving both models weight in decision-making.
The point of estimating a model on a big dataset is to estimate a rich model (one with many parameters). Using millions of observations to estimate a model with dozens of parameters is a waste of electricity.
Unless you have run a very large, very well-designed experiment, your problem has unobserved confounding information. If this problem does not occupy a lot of your time, you are doing something wrong.
Fixed effects normally aren’t. Mean reversion applies to most things, including unobserved information. Don’t be afraid to shrink.
Relationships observed in one group can almost always help us form better understanding of relationships in another group. Learn and use partial pooling techniques to benefit from this.
For decision-making, your estimated standard deviations are too small; your estimated degrees of freedom are too big, or your have confused one for the other. Remember, the uncertainty produced by your model is the amount of uncertainty you should have if your model is correct and the process you are modeling does not change.
You always have more information than exist in your data. Be a Bayesian, and use this outside information in your priors.

Bayesian Analysis
http://www.columbia.edu/~cjd11/charles_dimaggio/DIRE/styled-4/styled-11/code-8/#fnref2

Random Effects and Pooling
http://modernstatisticalworkflow.blogspot.com/2016/11/random-effects-partial-pooling-and.html

PPD
http://modernstatisticalworkflow.blogspot.com/2016/08/why-you-should-be-posterior-predictive.html

Hierarchical Partial Pooling
https://docs.pymc.io/notebooks/hierarchical_partial_pooling.html

sudo ln -s /usr/lib64/libgfortran.so.3 /usr/lib64/libgfortran.so
sudo ln -s /usr/lib64/libquadmath.so.0.0.0 /usr/lib64/libquadmath.so

Install rmate on the remote

gem install rmate

Install rsub in SublimeText3 via PackageControl and set up port forwarding as so if you’re using Putty ;

wp

Occasionally RStudio hangs when using the SparkR context.  I surmise some resource is not being managed well.  The symptoms come from running a few Spark jobs and are manifest as a frozen UI after a bit of time has passed.  The jobs run fine.

Here’s a gist with my reset mechanism.

sudo rstudio-server stop
cd ~
rm -rf .rstudio/
rm -rf .RData
rm -rf .Rhistory
sudo rstudio-server start

Hadoop stress testing can be run from the hadoop-mapreduce-client-jobclient-

Here are the options available
DFSCIOTest: Distributed i/o benchmark of libhdfs.
DistributedFSCheck: Distributed checkup of the file system consistency.
JHLogAnalyzer: Job History Log analyzer.
MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures
NNdataGenerator: Generate the data to be used by NNloadGenerator
NNloadGenerator: Generate load on Namenode
NNstructureGenerator: Generate the structure to be used by NNdatagenerator
SliveTest: HDFS Stress Test and Live Data Verification.
TestDFSIO: Distributed i/o benchmark.
fail: a job that always fails
filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed)
largesorter: Large-Sort tester
loadgen: Generic map/reduce load generator
mapredtest: A map/reduce test check.
minicluster: Single process HDFS and MR cluster.
mrbench: A map/reduce benchmark that can create many small jobs
nnbench: A benchmark that stresses the namenode.
sleep: A job that sleeps at each map and reduce task.
testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce
testfilesystem: A test for FileSystem read/write.
testmapredsort: A map/reduce program that validates the map-reduce framework’s sort.
testsequencefile: A test for flat files of binary key value pairs.
testsequencefileinputformat: A test for sequence file input format.
testtextinputformat: A test for text input format.
threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill

Here is an example of running the hdfs test;

First the write test. Note this writes results to a local directory.  Data is written to /benchmarks in hdfs

hadoop jar /usr/hdp/2.2.0.0-2041/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 4 -fileSize 1GB -resFile /home/hdfs/test/out

Then run the read test

hadoop jar /usr/hdp/2.2.0.0-2041/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -read -nrFiles 4 -fileSize 1GB -resFile /home/hdfs/test/out

Granted, we don’t get the benefits of hdfs or YARN, but I occasionally do my development on Windows and find it useful to have a version of Spark available for running under the debugger or trying some things out.

First install (unzip) java, maven, scala, sbt, and a spark distro

Put together a DOS env script (ends with .cmd) that looks like this;

SET PATH=%PATH%;C:\JavaDev\maven-3.2.5\bin;C:\JavaDev\scala-2.10.5\bin;C:\JavaDev\sbt\
SET JAVA_HOME=C:\JavaDev\jdk1.7.79
SET SCALA_HOME=C:\JavaDev\scala-2.10.5
SET MAVEN_HOME=C:\JavaDev\maven-3.2.5\bin
SET SPARK_HOME=C:\JavaDev\spark-1.4.0
REM SET MAVEN_OPTS="-Xmx1024M -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"

Run the spark build

mvn -DskipTests clean package

If Maven runs out of memory (it will) un-comment the MAVEN_OPTS line in the environment script, or set the java settings on the build command line

mvn -DskipTests clean package -DargLine="-Xmx1524m"

I like to have all the sources available for debugging – here is a recipe for getting that set up. It works on Windows and Linux.

There are 2 ways to get the sources for all dependencies in Maven project.

1)
Specifying -DdownloadSources=true -DdownloadJavadocs=true at the command line when building.

2)
Open your settings.xml file (~/.m2/settings.xml). Add a section with the properties added. Then make sure the activeProfiles contains this

<profiles>
<profile>
<id>downloadSources</id>
<properties>
<downloadSources>true</downloadSources>
<downloadJavadocs>true</downloadJavadocs>
</properties>
</profile>
</profiles>

<activeProfiles>
<activeProfile>downloadSources</activeProfile>
</activeProfiles>

Discardable Memory and Materialized Queries
http://hortonworks.com/blog/dmmq/

Materialized views with adapters for MongoDB, Apache Drill, and Spark

http://calcite.incubator.apache.org/

Nice article on probabilistic methods for aggregation

Probabilistic Data Structures for Web Analytics and Data Mining

Nice article on Sketches

Click to access sketches1.pdf

sudo yum update
sudo yum groupinstall 'Server with GUI'
sudo systemctl set-default graphical.target
sudo yum install tigervnc-server
sudo cp /lib/systemd/system/vncserver@.service /etc/systemd/system/vncserver@.service
sudo vi /etc/systemd/system/vncserver@.service

Put ec2-user where you find the text

sudo systemctl daemon-reload

Then as user

vncpasswd

sudo firewall-cmd --zone=public --add-port=5900/tcp
sudo firewall-cmd --zone=public --add-port=5901/tcp
sudo firewall-cmd --zone=public --add-port=5902/tcp
sudo firewall-cmd --zone=public --add-port=5903/tcp
sudo systemctl start vncserver@:1.service

1 in the last line corresponds to port 5901
To start another session use
systemctl start vncserver@:2.service

Then connect via a VNC client putting :1 for the session number 1 after the server ip/name.

ie ec2-XX-XX-XX-XX.compute-1.amazonaws.com:1 whwew XX-XX-XX-XX is the external ip address.

Ambari error – “Host Role in Invalid State”.

If the Ambari agent gets in a bad state you might encounter this error. Try to restart the Ambari client on the offending nodes;

ambari-agent restart

If this fails try to reinstall the clients from the Ambari hosts dashboard

ambari

If you’re upgrading Ambari from 1.6.0 to 1.7.0 there will be some properties that don’t match up. This is a bigger pain to fix up.

Here are the commands to install R on RHEL 6.5

wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
yum localinstall epel-release-6-8.noarch.rpm
yum install R

When installing R on RHEL 6.6 I encountered the following dependency problem

--> Finished Dependency Resolution
Error: Package: R-core-devel-3.2.1-1.el6.x86_64 (epel)
Requires: libicu-devel
Error: Package: R-core-devel-3.2.1-1.el6.x86_64 (epel)
Requires: blas-devel >= 3.0
Error: Package: R-core-devel-3.2.1-1.el6.x86_64 (epel)
Requires: lapack-devel
Error: Package: R-core-devel-3.2.1-1.el6.x86_64 (epel)
Requires: texinfo-tex

Execute these commands in order as su

wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
yum localinstall epel-release-6-8.noarch.rpm
wget http://mirror.centos.org/centos/6/os/x86_64/Packages/lapack-devel-3.2.1-4.el6.x86_64.rpm
wget http://mirror.centos.org/centos/6/os/x86_64/Packages/blas-devel-3.2.1-4.el6.x86_64.rpm
wget http://mirror.centos.org/centos/6/os/x86_64/Packages/texinfo-tex-4.13a-8.el6.x86_64.rpm
wget http://mirror.centos.org/centos/6/os/x86_64/Packages/libicu-devel-4.2.1-9.1.el6_2.x86_64.rpm
sudo yum localinstall *.rpm
yum install R

Rscript

require(multicore)
cat(sprintf("Multicore functions running on maximum %d cores",
ifelse(length(commandArgs(trailingOnly=TRUE)),
cores <- commandArgs(trailingOnly=TRUE)[1],
cores <- 1)))

## Multicore functions running on maximum 1 cores
cores is set to 1, but when I run as an Rscript, I can specify how many cores I want to use.

add mc.cores = cores to mclapply calls :

# Example processor hungry multicore operation
mats <- mclapply(1:500, function(x) matrix(rnorm(x*x),ncol = x) %*% matrix(rnorm(x*x), ncol = x), mc.cores = cores)

$ Rscript –vanilla R/myscript.R 12 &> logs/myscript.log &
Rscript runs the .R file as a standalone script, without going into the R environment. The –vanilla flag means that you run the script without calling in your .Rprofile (which is typically set up for interactive use) and without prompting you to save a workspace image.

## example #! script for a Unix-alike
#! /path/to/Rscript --vanilla --default-packages=utils
args <- commandArgs(TRUE)
res <- try(install.packages(args))
if(inherits(res, "try-error")) q(status=1) else q()

SparkR needs R and rJava.
install.packages("rJava")
You can check if rJava is installed correctly by running
library(rJava)
If you get no output after the above command, it means that rJava has been installed successfully! If you get an error message while installing rJava, then you might need to the following to configure Java with R. Exit R, run the following command in a shell and relaunch R to install rJava.
usb/$ R CMD javareconf -e

The Spark API for launching R worker

public static org.apache.spark.api.r.BufferedStreamThread createRWorker(String rLibDir, int port)

Apache Zookeeper is a distributed service providing application infrastructure for common distributed coordination tasks such as configuration management, distributed synchronization objects such as locks and barriers, leader election. Zookeeper can be used for cluster membership ops such as leader election and adding & removing of nodes. Zookeeper is used in the Hadoop ecosystem for high availability provisioning of YARN Resource Manager, HDFS NameNode fail over, HBase Master, and Spark Master.

The ZooKeeper service provides the abstraction of a set of data nodes -called znodes – organized into a hierarchical name space. The hierarchy of znodes in the namespace provide the objects used to keep state information. Access to nodes are provided my paths in the hierarchy.

Zookeeper uses the Zab distributed consensus algorithms this is similar to the classical Paxos algorithms. Zab and Paxos both follow a protocol where leader proposes values to the followers and then the leaders wait for acknowledgements from a quorum of followers before considering a proposal committed. Proposals include epoch numbers (ballot numbers in Paxos) that are unique version numbers.

In addition to configuration management, distributed locks and group membership algorithms; Zookeeper primitives can be used to implement;

Double barriers enable clients to synchronize the beginning and the end of a computation.
When enough processes, defined by the barrier threshold, have joined the barrier, processes start their computation and leave the barrier once they have finished.

Sometimes in distributed systems, it is not always clear a priori what the final system configuration will look like. For example, a client may want to start a master process and several worker processes, but the starting processes is done by a scheduler, so the client does not know ahead of time information such as addresses and ports that it can give the worker processes to connect to the master. We handle this scenario with ZooKeeper using a rendezvous znode.

Zookeeper Zab protocol messages are encapsulated in a QuorumPacket;
class QuorumPacket {
int type; // Request, Ack, Commit, Ping, etc
long zxid;
buffer data;
vector authinfo; // only used for requests
}

The basic API for manipulating nodes.

create(path, data, flags): Creates a znode with path name path, stores data[] in it, and returns the name of the new znode. flags enables a client to select the type of znode: regular, ephemeral, and set the sequential flag

delete(path, version): Deletes the znode path if that znode is at the expected version

getData(path, watch): Returns the data and meta-data, such as version information, associated with the znode

setData(path, data, version): Writes data[] to znode path if the version number is the current version of the znode

getChildren(path, watch): Returns the set of names of the children of a znode

sync(path): Waits for all updates pending at the start of the operation to propagate to the server that the client is connected to

Watches are used to monitor state changes. ZooKeeper watches are one-time triggers and due to the latency involved between getting a watch event and resetting of the watch, it’s possible that a client might lose changes done to a znode during this interval. In a distributed application in which a znode changes multiple times between the dispatch of an event and resetting the watch for events, developers must be careful to handle such situations in the application logic.

Settings are set in a number of ways;

  • $SPARK_HOME/conf/spark-defaults.conf – make a copy of the template and edit this if the file does not exist.
  • On the command line when submitting jobs or starting up the shell
  • Directly on the SparkContext object.

Shuffling
Spark stores intermediate data on disk from a shuffle operation as part of its “under-the-hood” optimization. When spark has to recompute a portion of a RDD graph, it may be able to truncate the lineage of a RDD graph if the RDD is already there as a side effect of an earlier shuffle. This can happen even if the RDD is not cached or explicitly persisted. Set the spark.shuffle.spill=false to turn this off if it is not needed.

Caching
The caching mechanism reserves a % of memory from the executor. This is specified in spark.storage.memoryFraction

Partitioning
Use more partitions as data size increases

Broadcast Variables
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

Memory Leaks
Closing over objects in lambdas can cause memory leaks. Check the size of the serialized Spark task to make sure there are no leaks.

Some very basic beta on accessing Hive table data via the Spark SQL. When I tried this the SQL features supported by SparkSQL was not up to that provided by Hive.  I was using Hive 0.13 and Spark 1.2.

First make sure that Spark is build with Hive enabled.  Building Spark is a separate issue that involves lot’s of considerations. See my post below on integrating native BLAS for MLLib.

Make sure that Spark has access to hive-site.xml, I copied mine to Spark conf folder;
cp [/usr/lib | hadoop install]/hive/conf/hive-site.xml [spark]/conf

Set the YARN conf dir;

export YARN_CONF_DIR=/etc/hadoop/conf

Launch the spark shell in YARN mode

./bin/spark-shell --verbose --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1

Use the SparkContext to get a sql context and then execute sql as so;

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.hql("use spark_test")
hiveContext.hql("INSERT INTO TABLE ...")

HDP Notes;

I’m using Spark 1.4 on Hortonworks 2.2 (Hadoop 2.6)
There is an additional step

Edit
$SPARK_HOME/conf/spark-defaults.conf and add the following settings:
spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041

Lastly, there is a parsing problem with some of the strings in hive-site.xml
If you see this exception from Hive;
java.lang.RuntimeException: java.lang.NumberFormatException: For input string: “5s”

Then change these settings in hive-site.xml to not have the ‘s’


hive.metastore.client.connect.retry.delay
5s

hive.metastore.client.socket.timeout
1800s

I’m setting up a Java / Scala /Spark development environment in windows and encountered this annoying error where I could not get IntelliJ to interact with GitHub.  There are two steps to achieving integration, one is to set up account credentials  and the other is to enable git.

This describes the registration process

https://www.jetbrains.com/idea/help/registering-github-account-in-intellij-idea.html

If you run into this error

Cannot run program “git.exe”: CreateProcess error=2, The system cannot find the file specified

  • Download Github For Windows client and install it.
  • Don’t try to find the path to git.exe by right clicking on the GitHub desktop icon – it is a manifest file that gives no hint where the exe resides.
  • Add git.exe location to your “Path Variable”. The location you should add will probably be something like : C:\Users\Your_Username\AppData\Local\GitHub\PortableGit_ca477551eeb4aea0e4ae9fcd3358bd96720bb5c8\bin
  • OR Set the path to git in the git section of the IntellJ VCS setup

Sometimes the logging output from spark-shell can get annoying, or your results can get lost in the content.

Use the instructions below to set the logging level

  • In the conf folder find log4j.properties.template
  • make a copy of this file and rename it to log4j.properties
  • Edit the file replacing INFO with WARN or ERROR

I’m using Hadoop 2.6.0 – There is no 2.6 profile so use 2.4.

mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -DskipTests clean package

Install gfortran

Link blas to /usr/lib/libblas.so.3 and LAPACK /usr/lib/liblapack.so.3 If building your own BLAS / LAPACK the instructions are to build without mutithreading. I do not know what the this translates to for use with Intel MKL.

Set up Java Scala Spark Hadoop environment variables – here’s my example

export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_72
export PATH=$JAVA_HOME/bin:$PATH
export SCALA_HOME=/home/bcampbell/spark/scala-2.10.4
export SCALA_BIN=$SCALA_HOME/bin
export PATH=$SCALA_BIN:$PATH
export SPARK_HOME=/home/bcampbell/spark/spark-1.2.1
export SPARK_BIN=$SPARK_HOME/bin
export PATH=$SPARK_BIN:$PATH

source /etc/hadoop/conf/hadoop-env.sh
source /etc/hive/conf/hive-env.sh
export HIVE_HOME=/usr/hdp/2.2.0.0-2041/hive
export HADOOP_HOME=/usr/hdp/2.2.0.0-2041/hadoop
export HCAT_HOME=/usr/hdp/2.2.0.0-2041/hive-hcatalog
export PATH=$PATH:$HIVE_HOME/bin
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/usr/hdp/2.2.0.0-2041/hive/lib:/usr/hdp/2.2.0.0-2041/hive-hcatalog/share/hcatalog
export HADOOP_CONF_DIR=/usr/hdp/2.2.0.0-2041/hadoop/conf
#export YARN_CONF_DIR

Some further notes from developing a sequence file writer designed to store large number of data files in the SequenceFile format.

There is a bug (in 2.6.0) for large values, the writer will not accept values greater than 2GB and throws a java.lang.NegativeArraySizeException

If your implementation returns keys and values as BytesWritable instances containing raw bytes the first four bytes of the returned buffer will contain run length, as defined by Writable in the java.io.DataOutput implementation. This 4 byte prefix must be stripped if you’re after only the bytes which the original BytesWriteable instance.

Do not call getBytes from the writable if you want to avoid the issue above. Call copyBytes from the BytesWriteable value object.
Also,don’t call write on the value in this case or you’ll get the 4 extra bytes mentioned above.

Snippet :


Writable key = (Writable) ReflectionUtils.newInstance(
sequenceFileReader.getKeyClass(), conf);
Writable value = (Writable) ReflectionUtils.newInstance(
sequenceFileReader.getValueClass(), conf);
boolean next = true;
while (next) {
try {
if (sequenceFileReader.next(key, value)) {
next = true;
} else {
next = false;
continue;
}
DataOutputStream output = new DataOutputStream(
new FileOutputStream(key.toString()));
BytesWritable bw = (BytesWritable) value;
byte[] bytes = bw.copyBytes();
int length = bytes.length;
//Only One of These
output.write(bytes);
//Can't do this
//value.write(output);
output.flush();
output.close();
} catch (Exception ex) {
log.error(ex.toString());
}
}

view raw

gistfile1.txt

hosted with ❤ by GitHub

Distcp is distributed copy of data from one cluster to other cluster.
It also supports the movement of data from s3 to hdfs and vice versa,
distcp can not be used to copy data between directories inside the same cluster.

dstcopy can move;
hdfs file
HBase table files
physical blocks

distcp can also to run an internal MapReduce job to copy files

If you want to use s3 standalone use s3n. s3n (the s3 native protocol) has a 5 Gig file size limitation of amazon.

The NameNode must be set up to access s3

fs.s3.awsAccessKeyId
MY-ID

fs.s3.awsSecretAccessKey
MY-SECRET

fs.s3n.awsAccessKeyId
MY-ID

fs.s3n.awsSecretAccessKey
MY-SECRET

You can test the upload
hadoop fs -ls s3://bucket-name/

Example;
su hdfs
$hadoop distcp s3n://AccessKey:SecretAccessKey@bucketname/input hdfs://[cluster name ]:8020/use

There are two key parameters to set when compressing values in a sequence file; the compression codec and the compression type.

Compression type indicates whether records are compressed, and if so whether they are record compressed or block compressed. Block compression may be more performant than if there is similarity between records. Additionally, a compressed record may span multiple blocks.

The codecs are classified as splittable or not. Most compression algorithms work on a stream and can not start decompressing mid-way. This is sub-optimal for HDFS where data is stored in blocks.

Splittable means that hdfs blocks can be decompressed in parallel and blocks do not need to be co-located for sequencefile decompression.

Bzip2 is a splittable format. The other codecs available are gzip, zlib, and lz4. LZO is available but has to be plugged into your Hadoop cluster.

Generally, Bzip2 is more compute intensive and yields better space savings than the other formats. Obviously this will be data dependent.

You may encounter this error if you do not have zlib native libraries accessable;
java.lang.IllegalArgumentException: SequenceFile doesn't work with GzipCodec without native-hadoop code!
Install zlib / zlib-devel and set up

Here is a gist with code snippets to get started


//Other options for typeare record and block. I'n not sure that block will work with any other codec than bzip2
CompressionType compressionType= CompressionType.NONE;
compressionCodecEnum {gzip, bzip2,none};
compressionCodecEnum compressionCodecType = compressionCodecEnum.bzip2;
if( compressionCodecType==compressionCodecEnum.bzip2)
{
org.apache.hadoop.io.SequenceFile.Writer.Option compressionClass = SequenceFile.Writer.valueClass(GzipCodec.class);
CompressionCodec Codec = new BZip2Codec();
org.apache.hadoop.io.SequenceFile.Writer.Option optCom = SequenceFile.Writer.compression(CompressionType.BLOCK, Codec);
sequenceFileWriter = SequenceFile.createWriter(conf, filePath, keyClass, valueClass,optCom);
}
if( compressionCodecType==compressionCodecEnum.gzip)
{
org.apache.hadoop.io.SequenceFile.Writer.Option compressionClass = SequenceFile.Writer.valueClass(GzipCodec.class);
CompressionCodec Codec = new GzipCodec();
org.apache.hadoop.io.SequenceFile.Writer.Option optCom = SequenceFile.Writer.compression(CompressionType.RECORD, Codec);
sequenceFileWriter = SequenceFile.createWriter(conf, filePath, keyClass, valueClass,optCom);
}
if( compressionCodecType==compressionCodecEnum.none)
{
sequenceFileWriter = SequenceFile.createWriter(conf, filePath, keyClass, valueClass);
}

view raw

gistfile1.txt

hosted with ❤ by GitHub

This error was encountered starting up the NodeManager (YARN) after Hadoop 2.4 re install using Ambari 1.6

It turns out that the old ResourceManager gets listed in an exclude YARN conf file. This entry needs to be removed from the yarn.exclude file in order to get the NodeNode manageer to connect to the ResourceManager.

Ie on a pseudodistributed cluster (HDP 2.1)

cd /etc/hadoop/conf/
more yarn.exclude
localhost.localdomain

Then remove the line localhost.localdomain

and restart the YARN service in Ambari.

Occasioanally the yum metatdata store will get corrupted. You may see an error like this;

[Errno -1] Metadata file does not match checksum
Trying other mirror.
Error: failure: [Errno 256] No more mirrors to try.You could try using --skip-broken to work around the problem

1) Try this first
yum clean metadata

2) If that does not ameliorate the issue then clean all
yum clean all

Generally it is not advisable to –skip-broken.

1) wget http://mirror.sdunix.com/apache/maven/maven-3/3.2.5/binaries/apache-maven-3.2.5-bin.tar.gz

2)tar xvf apache-maven-3.2.5-bin.tar.gz

3) mv apache-maven-3.2.5 /usr/local/apache-maven

4) Add env variables to your ~/.bashrc file

export M2_HOME=/usr/local/apache-maven
export M2=$M2_HOME/bin
export PATH=$M2:$PATH

Then run source ~/.bashrc

Some notes on building Spark for Hive. My ultimate goal is to run Hive 0.14 on Spark 1.12

Notes from Spark 1.1;
How to make spark sql run on apache-hive 0.13.0

1. build org.spark-project.hive 0.13.0 related jars from apache-hive
a. checkout out code from apache hive release-0.13.0
b. change protobuf.verson in pom.xml from 2.5.0 to 2.4.1 since spark 1.0 uses 2.4.1 version protobuf
c. use protobuf-java-2.4.1-shaded to generate a new ql/src/gen/protobuf/gen-java/org/apache/hadoop/hive/ql/io/orc/OrcProto.java
from ql/src/protobuf/org/apache/hadoop/hive/ql/io/orc/orc_proto.proto
d. build your own 0.13.0 jars to be used in spark sql
org.spark-project.hive:hive-metastore:0.13.0
org.spark-project.hive:hive-exec:0.13.0
org.spark-project.hive:hive-serde:0.13.0

2. make it compilable with hive 0.13 in spark sql
you can apply the patch in attachment to solve the incompatibility issues

————–
The Eclipse “Scala IDE” 4.0 RC1 can now support projects using different versions of Scala, which is convenient for Spark’s current 2.10.4 support and emerging 2.11 support.

Step 1)
wget -nv http://public-repo-1.hortonworks.com/HDP/centos6/2.x/GA/2.2.0.0/hdp.repo -O /etc/yum.repos.d/hdp.repo

Step 2)Find the source package
yum search hadoop |grep source 
yum search hadoop | grep source
hadoop-source.noarch : hadoop-source HDP virtual package
hadoop-yarn-resourcemanager.noarch : hadoop-yarn-resourcemanager HDP virtual
hadoop_2_2_0_0_2041-source.x86_64 : Source code for Hadoop
hadoop_2_2_0_0_2041-yarn-resourcemanager.x86_64 : YARN Resource Manager

Step 3) Install
yum install hadoop_2_2_0_0_2041-source.x86_6

The code ends up at
/usr/hdp/2.2.0.0-2041/hadoop/src/

For development you may find yourself in the situation where you’re manipulating hdfs files and the replication factor is greater than the number of nodes. I’ve previously noted some of the errors that one can encounter in this situation. To remedy this we set dfs.replication to 1 in our hdfs-site.xml config file. Recently I encountered a strange error when trying to append to an existing hdfs file;

java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[10.22.7.81:50010], original=[10.22.7.81:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:960)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1026)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1175)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:531)

If you encounter this error – read here;

https://issues.apache.org/jira/browse/HDFS-4600

And then set
dfs.client.block.write.replace-datanode-on-failure.enable to false

Here’s what the documentation says about this parameter;

dfs.client.block.write.replace-datanode-on-failure.enable true If there is a datanode/network failure in the write pipeline, DFSClient will try to remove the failed datanode from the pipeline and then continue writing with the remaining datanodes. As a result, the number of datanodes in the pipeline is decreased. The feature is to add new datanodes to the pipeline. This is a site-wide property to enable/disable the feature. When the cluster size is extremely small, e.g. 3 nodes or less, cluster administrators may want to set the policy to NEVER in the default configuration file or disable this feature. Otherwise, users may experience an unusually high rate of pipeline failures since it is impossible to find new datanodes for replacement. See also dfs.client.block.write.replace-datanode-on-failure.policy
dfs.client.block.write.replace-datanode-on-failure.policy DEFAULT This property is used only if the value of dfs.client.block.write.replace-datanode-on-failure.enable is true. ALWAYS: always add a new datanode when an existing datanode is removed. NEVER: never add a new datanode. DEFAULT: Let r be the replication number. Let n be the number of existing datanodes. Add a new datanode only if r is greater than or equal to 3 and either (1) floor(r/2) is greater than or equal to n; or (2) r is greater than n and the block is hflushed/appended.

I suspect that when appending a block replace is required and that this is the source of the problem.

I got this error when writing a Sequence File reader / writer class in Java. The code worked in Eclipse, but not on the command line. This causes me concern, because I was very careful to set up the POM correctly. In any event, initial research on the internet indicated that the problem was a classpath issue. I tried without success to resolve the issue.

This instructions in this post were able to get me past the issue;
http://stackoverflow.com/questions/17265002/hadoop-no-filesystem-for-scheme-file

The explanation made sense to me.

Configuration conf = new Configuration();
conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
conf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());

Fix missing jar error in Hive after upgrade to HDP 2.2

I got this error after upgrading to HDP 2.2

java.io.FileNotFoundException: File file:/usr/lib/hcatalog/share/hcatalog/hcatalog-core.jar does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)

This jar is now located in a new folder and aparantly the location was not set in the new hive-en.sh

To fix this change the jar location in /etc/hive/conf/hive-env.sh

export HIVE_AUX_JARS_PATH=/usr/hdp/2.2.0.0-2041/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar

Note that you may or may not have to change the setting for HIVE_AUX_JARS_PATH. This is for Hive udf’s and custom SerDe’s.

Note there are three locations for hive-env.sh

/etc/hive/conf.dist/hive-env.sh
/etc/hive/conf.server/hive-env.sh
/etc/hive/conf/hive-env.sh

If you’re using Ambari – there is a template for hive-env – set the paths here otherwise you’ll find your changes stepped on when the service is restarted.

Upgrading an existing data node in a Hadoop cluster can be a difficult task. Edge nodes that do not have hdfs components and are purely for client access can be upgraded by decommissioning the node and redeploying with Ambari. Decommission the node first in Ambari. This doe not remove the software. Run yum remove on all components;
This is overkill;
yum remove hcatalog\*
yum remove hive\*
yum remove hbase\*
yum remove zookeeper\*
yum remove oozie\*
yum remove pig\*
yum remove knox\*
yum remove snappy\*
yum remove hadoop-lzo\*
yum remove hadoop\*
yum remove extjs-2.2-1 mysql-connector-java-5.0.8-1\*
yum erase ambari-agent
yum erase ambari-server

In deploying the first time, you should have set up password-less ssh on the client node;

https://ambari.apache.org/1.2.1/installing-hadoop-using-ambari/content/ambari-chap1-5-2.html

Go get the same private key from the node running Ambari Server. Use this to re deploy via Ambari. You should get some warnings once the connection is made – use the python cleanup;

python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py --silent --skip=users

script to clean things up in bulk and then re-run the host checks.

These users need to be removed;
userdel ambari-qa
userdel oozie
userdel hcat
userdel hive
userdel yarn
userdel hdfs
userdel nagios
userdel mapred
userdel zookeeper
userdel tez
userdel rrdcached
userdel falcon
userdel sqoop

When creating a Apache Spark Hive context
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

in spark1.1.1 I ran into the following exception;

java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.security.ProxyUserAuthenticator

It turns out that ProxyUserAuthenticator was
introduced in hive 0.13, but Spark 1.1.1 SQL is based on hive 0.12. To fix the error locate the hive.security.authenticator.manager setting in hive-site.xml and change the value from org.apache.hadoop.hive.ql.security.ProxyUserAuthenticator to org.apache.hadoop.hive.ql.security.HadoopDefaultAuthentication

To get the best performance out of Spark / MLLib you need to ensure you’re using the native BLAS / LAPACK.

I this post we review the caveats of java – native interop as it applies to Spark / MLLIb. Spark is built on Scala. MLLIb is a Spark Machine Learning library that utilizes Breeze. Breeze is built on netlib-java and jblas. netlib-java is a wrapper for low-level BLAS, LAPACK and ARPACK that performs as fast as the C / Fortran interfaces.

netlib-java uses BLAS/LAPACK/ARPACK from:
1) delegating builds that use machine optimized system libraries
2) self-contained native builds using the reference Fortran from [netlib.org](http://www.netlib.org)
3) Java [F2J](http://icl.cs.utk.edu/f2j/) to ensure full portability on the JVM

The [JNILoader](https://github.com/fommil/jniloader) will attempt to load the implementations in this order automatically. The last option should be avoided for performance reasons – with some caveates about interop to be discussed below.

1) yum install atlas-devel
This will install the native BLAS LAPACK so’s.

These are only generic pre-tuned builds. To get optimal performance for a specific machine, it is best to compile Atlas locally

http://sourceforge.net/projects/math-atlas/files/latest/download

Install the shared libraries into a folder that is seen by the runtime linker (e.g. add your install folder to `/etc/ld.so.conf` then run `ldconfig`) ensuring that `libblas.so.3` and `liblapack.so.3`
exist and point to your optimal builds.

The Intel MKL may also be used by creating symbolic links from `libblas.so.3` and `liblapack.so.3` to `libmkl_rt.so`.

netlib-java is deployed as arpack_combined_all-0.1.jar with the java implementations. To use native so’s we might [I have conflicting documentation on this right now] have to build the this and the jni library.

Get netlib-java
git clone https://github.com/fommil/netlib-java.git netlib-java

Get Breeze
https://github.com/scalanlp/breeze.git

Building Breeze requires sbt -a Scala build tool.

Some reading on the JNI overhead;

http://blog.mikiobraun.de/2008/10/matrices-jni-directbuffers-and-number.html

http://mikiobraun.blogspot.com/2008/08/benchmarking-javac-vs-ecj-on-array.html

Now to getting the right BLAS LAPACK installed and connected to netlib

Out of the box –
import com.github.fommil.netlib.BLAS;
..
System.out.println(BLAS.getInstance().getClass().getName());

will give you this

Dec 05, 2014 8:10:27 AM com.github.fommil.netlib.BLAS
WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
Dec 05, 2014 8:10:27 AM com.github.fommil.netlib.BLAS
WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
com.github.fommil.netlib.F2jBLAS

The first step is to realize that HDP 2.1 is Apache Hadoop 2.4. Get Spark here; https://spark.apache.org/downloads.html

Spark is build on scala, so get scala;
http://www.scala-lang.org/download/2.9.3.html

Set $SCALA_HOME environment variable i.e.;
export SCALA_HOME=/usr/share/scala-2.9.3

and and run the Spark build from the root of spark directory;
./sbt/sbt -Dhadoop.version=2.4.0 -Pyarn assembly
This step takes some time.

To use the netlib-java JNI wrapper to Native Blas
build with -Pnetlib-lgpl

Now Tell Spark where your yarn-site is;
export YARN_CONF_DIR=/etc/hadoop/conf

MLLis is an alternative to Mahout for developing distributed machine learning algorithms on Hadoop.

Some of the features;
• Sparse data
• Classification and regression tree (CART)
• linear models (SVMs, logistic regression, linear regression)
• decision trees
• SVD and PCA
• L-BFGS
• Model evaluation
• Discretization

Mahout has plans to integrate with Spark in the future via Scala & Spark Bindings, but for now, Mahout used the YARN engine.
Scala & Spark Bindings for Mahout is a Scala DSL algebraic optimizer bound to in-core and distributed computations
Mahout Scala & Spark Bindings expression of the above:

MLlib uses the linear algebra package Breeze, which depends on netlib-java, and jblas. netlib-java.

jblas in turn depend on native Fortran routines. To run MLLIb we’ll the gfortran runtime library

To use native libraries from netlib-java, we build Spark with -Pnetlib-lgpl or include com.github.fommil.netlib:all:1.1.2 as a dependency of your project.

“If you want to use optimized BLAS/LAPACK libraries such as OpenBLAS, please link its shared libraries to /usr/lib/libblas.so.3 and /usr/lib/liblapack.so.3, respectively. BLAS/LAPACK libraries on worker nodes should be built without multithreading.”

MLlib can be used in Python via NumPy (use version 1.4 or newer).

JBLAS;
http://mikiobraun.github.io/jblas/

I’m curious whether the Intel MKL can be used for native BLAS / LAPACK.

If Ambari is running decommission all nodes, and follow these steps.

To completely remove all components of HW 2.1 Hadoop distribution;
<code>yum remove hcatalog\*
yum remove hive\*
yum remove hbase\*
yum remove zookeeper\*
yum remove oozie\*
yum remove pig\*
yum remove knox\*
yum remove snappy\*
yum remove hadoop-lzo\*
yum remove hadoop\*
yum remove extjs-2.2-1 mysql-connector-java-5.0.8-1\*
yum erase ambari-agent
yum erase ambari-server</code>

 
Then to reinstall we execute
wget http://public-repo-1.hortonworks.com/ambari/centos5/1.x/updates/1.6.1/ambari.repo
cp ambari.repo /etc/yum.repos.d
yum install ambari-server
ambari-server setup
ambari-server start
ambari-server status

If you get
SELinux status is ‘enabled’
SELinux mode is ‘permissive’
WARNING: SELinux is set to ‘permissive’ mode and temporarily disabled.
OK to continue [y/n] (y)?

To permanently turn off seLinux;
From the command line, you can edit the /etc/sysconfig/selinux file. This file is a symlink to /etc/selinux/config. The configuration file is self-explanatory. Changing the value of SELINUX or SELINUXTYPE changes the state of SELinux and the name of the policy to be used the next time the system boots.

There are Ambari config files left about after removal of the service, if you run into trouble with the installation procedure below, then grep for ambari config files and remove them.

Also,try this
ambari-server reset

If this error is encountered in CentOS install of Hortonworks via Ambari, then you will need to celan out your yum metadta cache. During its normal use yum creates a cache of metadata and packages. This cache can take up a lot of space. The yum clean command allows you to clean up these files. All the files yum clean will act on are normally stored in /var/cache/yum
yum clean all

To overcome the Samaba mount error
mount error(12): Cannot allocate memory<
Refer to the mount.cifs(8) manual page (e.g. man mount.cifs)

on CentOS connected to Windows 7 one can restart the Server service. This only provides a temporary fix. Also, the memory error can occur in the middle of a large file transfer. A more permanent fix is to make the following registry modifications.

Run RegEdit.exe;

Navigate to
HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management
Set LargeSystemCache key to 1

Navigate to
HKLM\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters
Set Size to 3

Restart the Server service.

These are notes from getting ports open and setup for Hadoop

Turn Off Firewall;
service iptables save
service iptables stop
chkconfig iptables off

Hive is supposed to be running on 10000, but I found this;

nmap localhost
...
10000/tcp open snet-sensor-mgmt

See what app is listening on port (8080 for example);
lsof -i :8080

Testing Ports – use nc – the command which runs netcat
Can you listenat 9000? Try nc -l 9000 and in another console nc localhost 9000, then all input will be echoed in the other console if the connection is good. Here is Ambari responding
nc localhost 8080
Hi
HTTP/1.1 400 Bad Request
Content-Length: 0
Connection: close
Server: Jetty(7.6.7.v20120910)

wxHexEditor is a good Linux replacement for HxD.


sudo yum install libtool gcc-c++ wxGTK-devel
svn checkout svn://svn.code.sf.net/p/wxhexeditor/code/trunk wxHexEditor
cd wxHexEditor/
make OPTFLAGS="-fopenmp"

I've been struggling with accessing Hive table data from MR. There were several stumbling blocks;

  • Making sure that the right libraries were being used by Maven.
  • Getting the correct hive-site.xml picked up by the configuration mechanism.
  • Sorting out differences between the old JobConf api and the and new YARN job api.

JobConf and everything in org.apache.hadoop.mapred package is part of the old API used to write hadoop jobs, Job and everything in the org.apache.hadoop.mapreduce package is  the new and  API to write hadoop jobs.  Main points here are to change mapred packages to mapreduce and use Configuration instead of JobConf.

Here's a good reference :

http://hadoopbeforestarting.blogspot.de/2012/12/difference-between-hadoop-old-api-and.html

Add the path to a hive-site.xml to the project java class path if you're developing in Eclipse and want to run in standalone / pseudo-distributed mode.

Initially, it  was not clear how to set up the Maven dependencies for this project.  I needed to pull in Hive, but the Hortonworks repo did not have the right version. I ended up using Hortonworks and Maven Central, pulling the hive dependencies from Maven Central.

I'll revisit this when it's all up and running.

 

Here are the Hive dependencies

org.apache.hadoop
hadoop-client
2.4.0.2.1.5.0-695

org.apache.hive.hcatalog
hive-hcatalog-core
0.13.1

And the repos;

Maven Repository Switchboard
http://repo1.maven.org/maven2

HDPReleases
HDP Releases
http://repo.hortonworks.com/content/repositories/releases/

Update: I keep running into a compatibility problem with my Hive MapReduce read code;
java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
From what I gather, the error comes up form mixing MR 1 and YARN code, but it 's not clear if that's the cause of this problem. I tried to run the HiveRead job outside of Eclipse to see if the error went away. Here are some step I had to take to get the Hive variables set up.

source /etc/alternatives/hadoop-conf/hadoop-env.sh
source /etc/alternatives/hive-conf/hive-env.sh
export HIVE_HOME=/usr/lib/hive
export PATH=$PATH:$HIVE_HOME/bin
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/usr/lib/hive/lib:/etc/alternatives/hive-conf

NOTES: I'm using maven /Eclipse

This error is encountered when building the jar without including dependencies
hadoop jar HiveRead.jar com.bcampbell.hadoopproject.App :
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf

I got this error after adding data to hdfs.  Additional space was allocated, but the region server did not recognize it.

Run this to check the files system;

hadoop fsck /user

The issue is that the hdfs filesystem was set up on /hadoop which was on a partition with very little space. When I tried to add a data directory in /home, Ambari complained. So, to keep it simple I added an additional disk and put the hadoop data there.
Here are commands to explore disks and partitions, create a new partition, and set the permissions. I’ll revisit later to comment ;

fdisk -l
pvs
vgs
lvs
/sbin/mkfs.ext3 -L /hadoopdata /dev/sdb
mkdir /hadoopdata
mount -t ext3 /dev/sdb /hadoopdata/
stat -c "%a" /hadoop
chmod 755 /hadoopdata/

Update:
For some reason Ambari would not pick up the additional data directory so I had to resort to an online expansion of the original LVM volume.
I re-purposed the extra drive as an LVM partition
(http://www.linuxuser.co.uk/features/resize-your-disks-on-the-fly-with-lvm)
fdisk /dev/sdb (Here make a primary partition with same filesytem (ext4 as the other LVM partitions)
pvcreate /dev/sdb1
vgextend VolGroup /dev/sdb1
lvextend -L +196G /dev/mapper/VolGroup-lv_root

At this point I thought I was all done, but the filesystem needs to be expanded to see the extra space. This is dangerous, especially since it was root. but it went well.

resize2fs /dev/mapper/VolGroup-lv_root

Lastly, I want to point out that the hdfs-site.xml setting
fds.datanode.du.reserved
is only used for checks, I was confused and thought that it would reserve space. Setting this to 0 removes the check.

When deploying a stand alone Hadoop you may encounter the bad data node error. I was able to ingest data through hdfs, but when inserting a file larger than one block, I encountered this error. To resolve this set the replication to 1. The default is 3 – the minimum for true high availability fail-over capability. I set up my environment for development only, so
In the hdfs-site.xml file set

dfs.replication
1

Note, if you’re using Ambari, do this through the config section of the hdfs service.

yum install hue

Follow the setups instructions at;
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap-hue.html

/etc/init.d/hue start

If there are problems with the port Dx;

iptables -A INPUT -p tcp –dport 8000 -j ACCEPT
netstat -an | grep 8000
nc 127.0.0.1 8000 < /dev/null; echo $?

I had to change the port to 8087 to get it to work on CentOS 6.5. Edit /etc/hue/conf/hue.ini for this.

Follow The directions in the Hortoworks Ambari Doc

Use the fully qualified name from
hostname -f

Make sure it’s in
/etc/hosts
and
/etc/sysconfig/network

Execute
setenforce 0

Check
umask 0022

In
/etc/yum/pluginconf.d/refresh-packagekit.conf
set
enabled=0

There is an openSSL bug in CenOS6.5
If you can not connect to the ambari-agent during set up
update
rpm -qa | grep openssl
yum upgrade openssl

For Manual Ambari Agent connection follow these steps

yum install epel-release
yum install ambari-agent

vi /etc/ambari-agent/conf/ambari-agent.ini

[server]
hostname={your.ambari.server.hostname}
url_port=4080
secured_url_port=8443

ambari-agent start

Once you connect if there are issues Ambari will report them- try to run the python script
python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py --silent --skip=users

NTPD Setup:Install and Configure NTP to Synchronize The System Clock
yum install ntp ntpdate ntp-doc
chkconfig ntpd on
ntpdate pool.ntp.org
/etc/init.d/ntpd start

turn Off iptables while Ambari runs – turn on later
/etc/init.d/iptables stop

Here are some setup instructions for Mahout and R Studio Server on Hortonworks Sandbox.

Run the commands below on the Sandbox VM.


yum install mahout

rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
yum -y install git wget R
ln -s /etc/default/hadoop /etc/profile.d/hadoop.sh
cat /etc/profile.d/hadoop.sh | sed 's/export //g' > ~/.Renviron
wget http://download2.rstudio.org/rstudio-server-0.97.332-x86_64.rpm
sudo yum install --nogpgcheck rstudio-server-0.97.332-x86_64.rpm

The default port RStudio server runs on is 8787; you can determine the IP address of the VM using ifconfig. Open a browser on the host OS and navigate to :8787

If you’re using the Oracle VM VirtualBox and R Studio won’t open in the browser then, navigate to network settings and do a port forward on 8787.

Check and run the server
sudo rstudio-server verify-installation

UPDATE – 1/2015 For installation on RHEL 6.6 you may need to install the
following;

wget http://mirror.centos.org/centos/6/os/x86_64/Packages/lapack-devel-3.2.1-4.el6.x86_64.rpm http://mirror.centos.org/centos/6/os/x86_64/Packages/blas-devel-3.2.1-4.el6.x86_64.rpm http://mirror.centos.org/centos/6/os/x86_64/Packages/libicu-devel-4.2.1-9.1.el6_2.x86_64.rpm http://mirror.centos.org/centos/6/os/x86_64/Packages/texinfo-tex-4.13a-8.el6.x86_64.rpm
sudo yum localinstall *.rpm

First install Maven – the entire Apache ecosystem is built with Maven. Then we’ll be editing the pom.xml.

To set up a Maven project run this;
mvn archetype:generate \
-DarchetypeGroupId=org.apache.maven.archetypes \
-DarchetypeArtifactId=maven-archetype-quickstart \
-DgroupId=com.bcampbell.hadoopproject \
-DartifactId=wordcount

Then edit the pom file that’s generated. This will then in turn be used to generate an Eclipse workspace.

Here’s the pom file;


<project xmlns="http://maven.apache.org/POM/4.0.0&quot; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance&quot;
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"&gt;
<modelVersion>4.0.0</modelVersion>
<groupId>com.bcampbell.hadoopproject</groupId>
<artifactId>wordcount</artifactId>
<version>0.0.1</version>
<packaging>jar</packaging>
<name>wordcount</name>
<url>http://maven.apache.org</url&gt;
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hadoop.version>2.3.0-cdh5.1.2</hadoop.version>
</properties>
<build>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
</plugin>
</plugins>
</pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins </groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<version>2.9</version>
<configuration>
<projectNameTemplate>
${project.artifactId}
</projectNameTemplate>
<buildOutputDirectory>
eclipse-classes
</buildOutputDirectory>
<downloadSources>true</downloadSources>
<downloadJavadocs>false</downloadJavadocs>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>1.7.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<version>2.9</version>
<configuration>
<buildOutputDirectory>eclipse-classes</buildOutputDirectory>
<downloadSources>true</downloadSources>
<downloadJavadocs>false</downloadJavadocs>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
<scope>provided</scope>
</dependency>
</dependencies>
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos</url&gt;
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
</repositories>
</project>

view raw

gistfile1.txt

hosted with ❤ by GitHub

Check the config and generate a build;
mvn validate
mvn compile
mvn package

After that – generate the eclipse workspace;
mvn -Declipse.workspace=eclipse_workspace eclipse:configure-workspace eclipse:eclipse

And Let Eclipse know about Maven
Window -> Preferences
Java -> Build Path -> Classpath Variables -> New
name will be M2_REPO
path will be something like ~/.m2
Click the OK button twice

Use the command like to check the setup;
java -cp wordcount-0.0.1.jar com.bcampbell.hadoopproject.App

Run the jar with the wordcount
hadoop jar wordcount-0.0.1.jar com.bcampbell.hadoopproject.WordCount /user/bcampbell/input output44

Sadly I’m getting many missing libraries in the Eclipse workspace. These are specified in Maven but did not get downloaded. They are all the Hadoop jars. There was a step for this, but I think I go think I got it wrong. I’ll update later when I get this working.

UPDATE – I changed the hadoop dependency to;

org.apache.hadoop
hadoop-client
2.3.0-cdh5.1.2

This resolved the missing jars in the .m2 repository directory.

To create directories in my install I had to three step ;
[root@localhost bcampbell]# sudo -u hdfs hadoop fs -mkdir /user
[root@localhost bcampbell]# sudo -u hdfs hadoop fs -mkdir /user/bcampbell
[root@localhost bcampbell]# sudo -u hdfs hadoop fs -mkdir /user/bcampbell/input
[root@localhost bcampbell]# sudo -u hdfs hadoop fs -chown bcampbell /user/bcampbell
[root@localhost bcampbell]# sudo -u hdfs hadoop fs -chown bcampbell /user/bcampbell/input

First take a look to see what’s in the jar
[bcampbell@localhost input]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.1.2.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviatio

Here’s the setup and output for running the grep example;


hadoop fs -put /etc/hadoop/conf/*.xml input
[bcampbell@localhost ~]$ hadoop fs -ls input
Found 7 items
-rw-r–r– 1 bcampbell supergroup 507105 2014-09-07 15:55 input/Milton_ParadiseLost.txt
-rw-r–r– 1 bcampbell supergroup 246679 2014-09-07 15:55 input/WilliamYeats.txt
-rw-r–r– 1 bcampbell supergroup 2133 2014-09-07 15:58 input/core-site.xml
-rw-r–r– 1 bcampbell supergroup 2324 2014-09-07 15:58 input/hdfs-site.xml
-rw-r–r– 1 bcampbell supergroup 246679 2014-09-07 15:56 input/inputWC
-rw-r–r– 1 bcampbell supergroup 1549 2014-09-07 15:58 input/mapred-site.xml
-rw-r–r– 1 bcampbell supergroup 2375 2014-09-07 15:58 input/yarn-site.xml
[bcampbell@localhost ~]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+'
14/09/07 16:00:07 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/09/07 16:00:07 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
14/09/07 16:00:07 INFO input.FileInputFormat: Total input paths to process : 7
14/09/07 16:00:08 INFO mapreduce.JobSubmitter: number of splits:7
14/09/07 16:00:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1410054700839_0002
14/09/07 16:00:09 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
14/09/07 16:00:09 INFO impl.YarnClientImpl: Submitted application application_1410054700839_0002
14/09/07 16:00:09 INFO mapreduce.Job: The url to track the job: http://localhost.localdomain:8088/proxy/application_1410054700839_0002/
14/09/07 16:00:09 INFO mapreduce.Job: Running job: job_1410054700839_0002
14/09/07 16:00:18 INFO mapreduce.Job: Job job_1410054700839_0002 running in uber mode : false
14/09/07 16:00:18 INFO mapreduce.Job: map 0% reduce 0%
14/09/07 16:00:23 INFO mapreduce.Job: map 29% reduce 0%
14/09/07 16:00:24 INFO mapreduce.Job: map 43% reduce 0%
14/09/07 16:00:25 INFO mapreduce.Job: map 57% reduce 0%
14/09/07 16:00:26 INFO mapreduce.Job: map 100% reduce 0%
14/09/07 16:00:30 INFO mapreduce.Job: map 100% reduce 100%
14/09/07 16:00:30 INFO mapreduce.Job: Job job_1410054700839_0002 completed successfully
14/09/07 16:00:30 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=330
FILE: Number of bytes written=740425
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1009700
HDFS: Number of bytes written=470
HDFS: Number of read operations=24
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=7
Launched reduce tasks=1
Data-local map tasks=7
Total time spent by all maps in occupied slots (ms)=20069
Total time spent by all reduces in occupied slots (ms)=3482
Total time spent by all map tasks (ms)=20069
Total time spent by all reduce tasks (ms)=3482
Total vcore-seconds taken by all map tasks=20069
Total vcore-seconds taken by all reduce tasks=3482
Total megabyte-seconds taken by all map tasks=20550656
Total megabyte-seconds taken by all reduce tasks=3565568
Map-Reduce Framework
Map input records=27113
Map output records=10
Map output bytes=304
Map output materialized bytes=366
Input split bytes=856
Combine input records=10
Combine output records=10
Reduce input groups=10
Reduce shuffle bytes=366
Reduce input records=10
Reduce output records=10
Spilled Records=20
Shuffled Maps =7
Failed Shuffles=0
Merged Map outputs=7
GC time elapsed (ms)=323
CPU time spent (ms)=6260
Physical memory (bytes) snapshot=2039488512
Virtual memory (bytes) snapshot=5680246784
Total committed heap usage (bytes)=1610612736
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1008844
File Output Format Counters
Bytes Written=470
14/09/07 16:00:30 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/09/07 16:00:30 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
14/09/07 16:00:30 INFO input.FileInputFormat: Total input paths to process : 1
14/09/07 16:00:30 INFO mapreduce.JobSubmitter: number of splits:1
14/09/07 16:00:30 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1410054700839_0003
14/09/07 16:00:30 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
14/09/07 16:00:30 INFO impl.YarnClientImpl: Submitted application application_1410054700839_0003
14/09/07 16:00:30 INFO mapreduce.Job: The url to track the job: http://localhost.localdomain:8088/proxy/application_1410054700839_0003/
14/09/07 16:00:30 INFO mapreduce.Job: Running job: job_1410054700839_0003
14/09/07 16:00:37 INFO mapreduce.Job: Job job_1410054700839_0003 running in uber mode : false
14/09/07 16:00:37 INFO mapreduce.Job: map 0% reduce 0%
14/09/07 16:00:43 INFO mapreduce.Job: map 100% reduce 0%
14/09/07 16:00:49 INFO mapreduce.Job: map 100% reduce 100%
14/09/07 16:00:50 INFO mapreduce.Job: Job job_1410054700839_0003 completed successfully
14/09/07 16:00:50 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=330
FILE: Number of bytes written=184533
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=605
HDFS: Number of bytes written=244
HDFS: Number of read operations=7
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=3171
Total time spent by all reduces in occupied slots (ms)=3435
Total time spent by all map tasks (ms)=3171
Total time spent by all reduce tasks (ms)=3435
Total vcore-seconds taken by all map tasks=3171
Total vcore-seconds taken by all reduce tasks=3435
Total megabyte-seconds taken by all map tasks=3247104
Total megabyte-seconds taken by all reduce tasks=3517440
Map-Reduce Framework
Map input records=10
Map output records=10
Map output bytes=304
Map output materialized bytes=330
Input split bytes=135
Combine input records=0
Combine output records=0
Reduce input groups=1
Reduce shuffle bytes=330
Reduce input records=10
Reduce output records=10
Spilled Records=20
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=58
CPU time spent (ms)=2140
Physical memory (bytes) snapshot=431476736
Virtual memory (bytes) snapshot=1437347840
Total committed heap usage (bytes)=402653184
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=470
File Output Format Counters
Bytes Written=244
[bcampbell@localhost ~]$ hadoop fs -ls output23
Found 2 items
-rw-r–r– 1 bcampbell supergroup 0 2014-09-07 16:00 output23/_SUCCESS
-rw-r–r– 1 bcampbell supergroup 244 2014-09-07 16:00 output23/part-r-00000
[bcampbell@localhost ~]$ hadoop fs -cat output23/part-r-00000 | head
1 dfs.safemode.min.datanodes
1 dfs.safemode.extension
1 dfs.replication
1 dfs.namenode.name.dir
1 dfs.namenode.checkpoint.dir
1 dfs.domain.socket.path
1 dfs.datanode.hdfs
1 dfs.datanode.data.dir
1 dfs.client.read.shortcircuit
1 dfs.client.file
[bcampbell@localhost ~]$

view raw

gistfile1.txt

hosted with ❤ by GitHub

First This


yum update
yum groupinstall "Books and Guides" "C Development Tools and Libraries" "Development Tools" "Fedora Eclipse" "System Tools" "Editors"
rpm -ivh jdk-7u67-linux-x64.rpm
#as su
alternatives –install /usr/bin/java java /usr/java/jdk1.7.0_67/jre/bin/java 2
alternatives –install /usr/bin/javaws javaws /usr/java/jdk1.7.0_67/jre/bin/javaws 2
alternatives –install /usr/bin/javac javac /usr/java/jdk1.7.0_67/bin/javac 2
alternatives –install /usr/bin/jar jar /usr/java/jdk1.7.0_67/bin/jar 2
#For Browswer Plugin –
alternatives –install /usr/lib64/mozilla/plugins/libjavaplugin.so libjavaplugin.so.x86_64 /usr/java/jdk1.7.0_67/jre/lib/amd64/libnpjp2.so 2
alternatives –config java
alternatives –config javac
#Set JAVA_HOME in /etc/profile Make sure the JAVA_HOME environment variable is set for the root user on each node. You
can check by using a command such as
$ sudo env | grep JAVA_HOME
export JAVA_HOME=/usr/java/jdk1.7.0_67
export PATH=$JAVA_HOME/bin:$PATH
#Lastly – if you want the JAVA_HOME set for all users – do this
#in /etc/profile.d make a shell script to run the two commands;
export JAVA_HOME=/usr/java/jdk1.7.0_67
export PATH=$JAVA_HOME/bin:$PATH

view raw

gistfile1.txt

hosted with ❤ by GitHub

Then from here;
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Quick-Start/cdh5qs_yarn_pseudo.html

sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm

Install the CDH – key

rpm --import http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera

then the business

sudo yum install hadoop-conf-pseudo

Check that things went well.
rpm -ql hadoop-conf-pseudo
/etc/hadoop/conf.pseudo
/etc/hadoop/conf.pseudo/README
/etc/hadoop/conf.pseudo/core-site.xml
/etc/hadoop/conf.pseudo/hadoop-env.sh
/etc/hadoop/conf.pseudo/hadoop-metrics.properties
/etc/hadoop/conf.pseudo/hdfs-site.xml
/etc/hadoop/conf.pseudo/log4j.properties
/etc/hadoop/conf.pseudo/mapred-site.xml
/etc/hadoop/conf.pseudo/yarn-site.xml

run these commands next

sudo -u hdfs hdfs namenode -format
for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done
sudo -u hdfs hadoop fs -rm -r /tmp
sudo -u hdfs hadoop fs -mkdir -p /tmp/hadoop yarn/staging/history/done_intermediate
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
sudo -u hdfs hadoop fs -mkdir -p /var/log/hadoop-yarn
sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn
sudo -u hdfs hadoop fs -ls -R /
sudo service hadoop-yarn-resourcemanager start
sudo service hadoop-yarn-nodemanager start
sudo service hadoop-mapreduce-historyserver start

I’ve had a problem making a user directory that I must revisit.
I kept getting permission or no such file errors;

[root@localhost bcampbell]# sudo -u hdfs hadoop fs -mkdir “/home/bcampbell/input”
mkdir: `/home/bcampbell/input’: No such file or directory

This works;
sudo -u hdfs hadoop fs -mkdir /tmp/input
Possibly because it is a subdir of directory I already created.

This works after I added the -p command
sudo -u hdfs hadoop fs -chown bcampbell /home/bcampbell/input/
[root@localhost bcampbell]# sudo -u hdfs hadoop fs -chown bcampbell /home/bcampbell/hadoop

And then I was back in business. These all worked without issue.

sudo -u hdfs hadoop fs -mkdir -p /home/bcampbell/input
sudo -u hdfs hadoop fs -mkdir /home/bcampbell/hadoop
sudo -u hdfs hadoop fs -chown bcampbell /home/bcampbell/input/
sudo -u hdfs hadoop fs -chown bcampbell /home/bcampbell/hadoop

If all is well, add a file and do a check;
[bcampbell@localhost ~]$ hadoop fs -copyFromLocal /home/bcampbell/Downloads/Milton_ParadiseLost.txt /home/bcampbell/input
[bcampbell@localhost ~]$ hdfs fsck /home/bcampbell/input/ -files -blocks
14/09/07 10:33:57 WARN ssl.FileBasedKeyStoresFactory: The property 'ssl.client.truststore.location' has not been set, no TrustStore will be loaded
Connecting to namenode via http://localhost:50070
FSCK started by bcampbell (auth:SIMPLE) from /127.0.0.1 for path /home/bcampbell/input/ at Sun Sep 07 10:33:58 EDT 2014
/home/bcampbell/input/
/home/bcampbell/input/Milton_ParadiseLost.txt 507105 bytes, 1 block(s): OK
0. BP-1947923207-127.0.0.1-1410054283948:blk_1073741826_1002 len=507105 repl=1

Status: HEALTHY
Total size: 507105 B
Total dirs: 1
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 507105 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 1.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Sun Sep 07 10:33:58 EDT 2014 in 14 milliseconds

The filesystem under path '/home/bcampbell/input/' is HEALTHY

Next steps is to run a word count example- with YARN.

I’ve had a hard time getting the java setup correct for running CDH4 on Fedora / RHEL. Fedora comes with OpenJDK and CDH requires the Oracle jdk. My basic understanding is that one installs the Oracle jdk and then sets up links via the “alternatives” command.
Here are some notes – this is wip until I get it all running


————————–
Swap between OpenJDK and Sun/Oracle Java JDK/JRE
alternatives –config java
alternatives –config javaws
alternatives –config libjavaplugin.so
alternatives –config libjavaplugin.so.x86_64
alternatives –config javac
Post-Installation Setup
Add JAVA_HOME environment variable to /etc/profile file or $HOME/.bash_profile
Java JDK and JRE latest version (/usr/java/latest)
You can make a simbolic link from your Java JDK to that directory, so it makes it easier to handle several versions of java
# export JAVA_HOME JDK/JRE
export JAVA_HOME="/usr/java/latest"
Java JDK and JRE absolute version (/usr/java/jdk1.7.0_09) or other version
# export JAVA_HOME JDK
export JAVA_HOME="/usr/java/jdk1.7.0_09"
# export JAVA_HOME JRE #
export JAVA_HOME="/usr/java/jre1.7.0_09"
—————————————

view raw

gistfile1.txt

hosted with ❤ by GitHub

Get the Oracle JDK – no neet to get the JRE it’s in the JDK.

http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html

Install the rpm;

rpm -ivh jdk-7u67-linux-x64.rpm

Then set up the links
[root@localhost jvm]# alternatives --install /usr/bin/java java /usr/java/jdk1.7.0_67/jre/bin/java 2
[root@localhost jvm]# alternatives --install /usr/bin/javaws javaws /usr/java/jdk1.7.0_67/jre/bin/javaws 2
[root@localhost jvm]# alternatives --install /usr/bin/javac javac /usr/java/jdk1.7.0_67/bin/javac 2
[root@localhost jvm]# alternatives --install /usr/bin/jar jar /usr/java/jdk1.7.0_67/bin/jar 2
[root@localhost jvm]# alternatives --install /usr/lib64/mozilla/plugins/libjavaplugin.so libjavaplugin.so.x86_64 /usr/java/jdk1.7.0_67/jre/lib/amd64/libnpjp2.so 2
[root@localhost jvm]# alternatives --config java

There are 2 programs which provide 'java'.

Selection Command
-----------------------------------------------
*+ 1 /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65-2.5.1.3.fc20.x86_64/jre/bin/java
2 /usr/java/jdk1.7.0_67/jre/bin/java

Enter to keep the current selection[+], or type selection number:
[root@localhost jvm]# alternatives --config javac

There are 2 programs which provide 'javac'.

Selection Command
-----------------------------------------------
*+ 1 /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65-2.5.1.3.fc20.x86_64/bin/javac
2 /usr/java/jdk1.7.0_67/bin/javac

Enter to keep the current selection[+], or type selection number:
[root@localhost jvm]# alternatives --config javaws

#Set JAVA_HOME in /etc/profile Make sure the JAVA_HOME environment variable is set for the root user on each node. You can check by using a command such as
sudo env | grep JAVA_HOME

Lastly – if you want the JAVA_HOME set for all users – do this
in /etc/profile.d make a shell script to run the two commands;

export JAVA_HOME=/usr/java/jdk1.7.0_67
export PATH=$JAVA_HOME/bin:$PATH

There is a warning not to change the /etc/profile script – follow it. All the shell scripts in etc/profile.d get called when the profile script is run.

I’ve implemented a .NET performance counter in Managed C++. This post contains some details on the class and notes on installing the counter.

The source is in the klImageCore repository;

https://github.com/wavescholar/klImageCore

https://github.com/wavescholar/klImageCore/blob/master/klImageCoreV/inc/klPerformanceCounter.h

Doxygen documentation is at

http://wavescholar.github.io/klImageCore/classkl_counters_1_1kl_performance_counter.html

Installation of the counter requires registry access. Usually this can be achieved by running the application or Visual Studio as administrator. But if you don’t like these options, there are steps to take that will allow the assembly to edit the registry.

UPDATE – not related to the problem I was seeing – but there is no JobTracker when running with YARN.

When trying to run the Hadoop Example WordCount I encountered the error below. If this happens to you try to see if the JobTracker is running on port 8021 with this command; netstat -atn | grep LISTEN. There was nothing listening on that port so I’ll investigate and update later. See my post below – the web API gave me a response, but it was this error ;

Incompatible shuffle request version
[bruce@localhost hadoopTest]$ hadoop jar /usr/share/java/hadoop/hadoop-mapreduce-examples.jar grep input output 'dfs[a-z.]+'

14/09/03 20:50:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/09/03 20:50:51 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/09/03 20:50:51 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String). 14/09/03 20:50:51 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/bruce/.staging/job_1409694147925_0003 14/09/03 20:50:51 ERROR security.UserGroupInformation: PriviledgedActionException as:bruce (auth:SIMPLE) cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:8020/user/bruce/input org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:8020/user/bruce/input ... at org.apache.hadoop.util.RunJar.main(RunJar.java:212) input WordCount.java [bruce@localhost hadoopTest]$ netstat -atn | grep LISTEN tcp 0 0 127.0.0.1:8020 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN tcp 0 0 127.0.0.1:631 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:50075 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:50020 0.0.0.0:* LISTEN tcp6 0 0 :::8042 :::* LISTEN tcp6 0 0 ::1:631 :::* LISTEN tcp6 0 0 :::8088 :::* LISTEN tcp6 0 0 :::13562 :::* LISTEN tcp6 0 0 :::8030 :::* LISTEN tcp6 0 0 :::8031 :::* LISTEN tcp6 0 0 :::8032 :::* LISTEN tcp6 0 0 :::8033 :::* LISTEN tcp6 0 0 :::38951 :::* LISTEN tcp6 0 0 :::8040 :::* LISTEN

Follow Bruce B Campbell on WordPress.com