Wednesday, 30 January 2013

Problems using ROME to read an RSS feed? It could be a byte order mark problem

Getting
com.sun.syndication.io.ParsingFeedException: Invalid XML at com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:210)
errors when using the Rome RSS Java Library? It could be a byte order mark problem. Luckily Apache Commons IO can help. Here is a test case to demonstrate the problem and how to fix it.

Using Lucene 4 to calculate Cosine Similarity

Sujit Pal's example of how to calculate cosine similarity with Lucene is very useful. The Lucene 4 API has substantial changes so here is a version rewritten for Lucene 4.

Monday, 30 January 2012

Running Hive 0.7 with remote metastore using MySQL requires HDFS trash enabled

Running Hive in multi-user mode with a remote metastore when the client is running as a different user to the metastore fails when loading data into tables unless trash is enabled:

$: sudo -u hdtester hive -f HIVE-MULTI-06.hql
Hive history file=/tmp/hdtester/hive_job_log_hdtester_201201301103_941823740.txt
OK
Time taken: 0.345 seconds
OK
Time taken: 0.113 seconds
Copying data from file:/tmp/HIVE-MULTI-06.txt
Copying file: file:/tmp/HIVE-MULTI-06.txt
Loading data to table default.hivemulti06
rmr: org.apache.hadoop.security.AccessControlException:
Permission denied: user=hdtester, access=ALL, inode="hivemulti06":hive:hive:rwxr-xr-x
Failed with exception org.apache.hadoop.security.AccessControlException:
Permission denied: user=hdtester, access=ALL, inode="hivemulti06":hive:hive:rwxr-xr-x
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask

However if you enable trash in hdfs-site.xml

<property>
<name>fs.trash.interval</name>
<value>1440</value>
<final>true</final>
<description>Number of minutes between trash checkpoints.
If zero, the trash feature is disabled.
</description>
</property>

Then it works fine

$: sudo -u hdtester hive -f HIVE-MULTI-06.hql
Hive history file=/tmp/hdtester/hive_job_log_hdtester_201201301117_543476780.txt
OK
Time taken: 0.362 seconds
OK
Time taken: 0.108 seconds
Copying data from file:/tmp/HIVE-MULTI-06.txt
Copying file: file:/tmp/HIVE-MULTI-06.txt
Loading data to table default.hivemulti06
Moved to trash: hdfs://localhost:54310/user/hive/warehouse/hivemulti06
OK

Monday, 9 January 2012

Building Apache Sqoop on RHEL 5

Unfortunately our cluster runs on a fairly old version of Linux - RHEL 5. Here are some tips to build Apache Sqoop 1.4 on this environment. First you need to make sure xmlto and lsb-release are installed:

$ sudo yum install xmlto redhat-lsb

Then you need to check if asciidoc is installed:

$ sudo rpm -q asciidoc

If not you can install if from repoforge:

$ wget http://pkgs.repoforge.org/asciidoc/asciidoc-8.6.6-2.elf5.rf.noarch.rpm
$ sudo yum install asciidoc-8.6.6-2.elf5.rf.noarch.rpm

Finally RHEL 5 only comes with Docbook 4.4, not 4.5. However, for Sqoop at least, you can get around this by saying 4.5 is the same as 4.4.

$ sudo vi /etc/xml/catalog

and add the lines:

<rewriteSystem systemIdStartString="http://www.oasis-open.org/docbook/xml/4.5" rewritePrefix="/usr/share/sgml/docbook/xml-dtd-4.4-1.0-30.1"/>

Friday, 23 December 2011

Rolling your own Hadoop distribution with Jenkins

My current client maintains its own in-house versions of Hadoop, Pig and Hive. They apply additional patches to released versions, occassionally backporting from trunk. Software quality and reliability is very important to them, so they put these different components through a quality assurance process.

The Apache components already come with suites of unit and system tests, so if you are doing this too it is worth becoming familiar with the various test suites and re-use them wherever possible, and only create your own tests when really necessary. For example see the various documents about testing Pig (overview, proposal, unit tests, system tests), Hadoop (unit tests, system tests), HBase (unit tests, system tests) and Hive (unit tests).

One area we do additional testing is compression support. LZO and Hadoop-LZO provide improvements in both cluster throughput and cluster storage capability, but because they are licensed under GPL, they have been separated from Hadoop, so building them is a little bit more complicated, and ideally they should be tested with each of the different Hadoop components.

I have been doing some work on automating this process. Let's consider how to build and test Hadoop using Jenkins. Here we are going to build Hadoop 0.20.203.0, so we will be using Ant. Note newer versions of Hadoop are switching to Maven.

To build Hadoop, we first needed to build LZO, because we were using a build cluster, so we could not guarantee that LZO would be available on a particular worker. Normally the final stage of building LZO is run as root in order to install it to /usr/local. However root privileges are not available on continuous build. You can use DESTDIR so LZO is prepared locally, for example:


You can then configure Jenkins to archive the artifacts in build as a post build action, so the LZO artifacts are available to other components. Then the next step is to create another job for Hadoop-LZO. In order to do this you will need to make a few small changes to Hadoop-LZO so it is possible to publish the results to your local Maven repository. Then you need to retrieve the LZO artifact you created and set C_INCLUDE_PATH to point to the include files and LIBRARY_PATH, LD_LIBRARY_PATH and JAVA_LIBRARY_PATH to point to the native libraries. Then you can build it by calling the package target, and the published target to publish it to the local Maven repository:


Then you need to make a third job to build Hadoop itself. Here I assume you are building Hadoop 0.20.203.0. Now, as already mentioned, we want to run and pass the Apache test suites. With LZO and Hadoop-LZO, this is not too difficult, but with the bigger components, there are number of things that can cause tests to fail.

It is important to verify your CI machine is running builds with a umask of 022 because Hadoop, specifically DataNode.java checks the read-write permissions of any data directories it creates, and this will fail if the umask is set wrong.

Next, it is necessary to use Ant 1.7.1 because several unit tests have been labelled @Ignore. However Ant 1.8.2 still runs these tests causing failures.

Also, use a reasonably powerful machine for CI. I have encountered test failures in Hadoop that have disappeared simply by doubling the number of processors.

Finally Hadoop needs Apache Forrest, specifically Apache Forrest version 0.8 to build the documentation. This in turn has a dependency on Java JDK1.5, although it is possible to get around this by modifying
hadoop-0.20.203.0/src/docs/forrest.properties and hadoop-0.20.203.0/src/docs/cn/forrest.properties to turn off validating the sitemap and the stylesheets:


Now here is an example of a build script to build Hadoop. By paying attention to the points above, it is possible to build and test Hadoop and pass all the Apache tests:


One key point about this script and the previous one is the way it uses env to set LIBRARY_PATH, LD_LIBRARY_PATH and JAVA_LIBRARY_PATH to ensure the build can find the LZO native libraries, even though they are not installed in /usr/local. Build scripts for Pig, Hive and HBase can be written in a similar way. LIBRARY_PATH includes native libraries at compile time, whereas LD_LIBRARY_PATH and JAVA_LIBRARY_PATH include them at run time.

Personally, I prefer to use scripts to set up the environment variables required to build and then call the scripts from Jenkins rather than do this directly in Jenkins because then it is possible to check the script into source control. Also our desktops out-perform our CI machine, so it is nice to be able to run exactly the same build scripts locally and on the build server, although there are a few differences in the environments that need to be detected by the scripts.

Configuring HBase using Spring

Deploying a webapp that uses HBase to multiple servers that use different Zookeeper servers? Well you will need to configure hbase.zookeeper.quorum then. Here's a little recipe to configure HBase in a Spring application context. Hadoop's configuration class, org.apache.hadoop.conf.Configuration, does not follow bean conventions, specifically it uses addResource to specify an alternative configuration file. One way around this is to leverage Spring's map support. First create a wrapper around org.apache.hadoop.hbase.HBaseConfiguration:


Then you can easily add some server specific configuration in Spring's applicationContext.xml:


SpringSource does have a project for using Spring with Hadoop.

Monday, 22 August 2011

Multiple cluster configurations in Hadoop

I am working on automating some integration tests for Hadoop, Hive and Pig at the moment. One of the requirements is that the tests run in different environments e.g. on a single node and on a cluster. In Tom White's excellent book on Hadoop he recommends:

When developing Hadoop applications, it is common to switch between running the application locally and running on a cluster. One way to do this is to have Hadoop configuration files containing the connection settings for each cluster you run against, and specify which one you are using when you run Hadoop applications or tools. As a matter of best practice, it is recommended to keep these outside Hadoop's installation directory tree, as this makes it easy to switch between Hadoop versions without duplicating or losing settings.

Specifically this is done using the HADOOP_CONF_DIR environment variable or by starting Hadoop using the --config option, for example:

/app/hadoop/bin/start-all.sh --config /home/mbutler/work/hadoop/config/singleNode

However, if you are using Hadoop with Pig or Hive, then things are a little bit more involved. Pig and Hive still have the --config option, but now it selects the configuration for Pig or Hive respectively, rather than for Hadoop. In Pig, one way you can select the correct Hadoop configuration is by setting the PIG_CLASSPATH environment variable in pig-env.sh in the Pig configuration directory or at the command line, for example:

PIG_CLASSPATH=/home/mbutler/hadoop/work/config/singleNode

export PIG_CLASSPATH
/app/pig/pig --config /home/mbutler/work/pig/config/pigConfig

In Hive, your only option is to set the HADOOP_CONF_DIR in hive-env.sh in the Hive configuration directory. If you intend to put the Hadoop configuration files in the same directory as the hive-env.sh then you can use a line like this:

export HADOOP_CONF_DIR=${PWD}/$(dirname $0)