Elephant Enlightenment: Part 1

May 21, 2013

For some light vacation reading, I started reading Hadoop Beginner’s Guide. I made it through about half of the book, and I wanted to share some random facts that I found particularly enlightening.

Data Serialization: Compression and Splitting

Splitting refers to the ability of Hadoop to split input files into chunks for input into the map phase of a MapReduce job. Splitting is important for 2 reasons:

It allows the map phase to be parallelized. The more splits you can make of the data, the more map processes that can be run simultaneously.
It allows for data locality. It helps ensure the data being processed by your map process is available on the node where the data lives. Hadoop parallelizes data storage, if the data is stored on the same node the map task is being run on, the map phase will be more efficient.

When choosing a “container format” (a.k.a serialization format) for your data, you need to make sure that you pick a format that is both splittable, compressible, and fast. There are a few container file formats these include Sequence File, RCFile, and Avro. These formats all support both splitting and compression. Of these, Avro seems the most promising as it has good cross language support. The main issue with using these formats is that you probably need a pre-processing phase where you convert your data into this format.

If you don’t want to use one of the container formats, but you want your data to be splitable, and you want your data to be compressed, you have 2 options.

Use bz2 compression, this is the only compression format that supports splitting out of the box.
Manually split your data into chunks and compress each chunk You can find a great cheat sheet for compression & splitting in Table 4-1 of Hadoop: The Definitive Guide

Data Loss

Data in Hadoop is replicated, but there are many ways you can lose data in Hadoop, so it’s not an alternative to backups. Here are just a few ways you can lose data in Hadoop.

Parallel node failure

“As the cluster size increases, so does the failure rate and having three node failures in a narrow window of time becomes less and less unlikely. Conversely, the impact also decreases but rapid multiple failures will always carry a risk of data loss.” [1]

Cascading failures

A failure on one node will cause under-replicated data to be replicated to other nodes, which could result in additional failures, cascading to other machines, and so on. While this scenario is unlikely, it can occur.

Human Error

Data is not backed up or check-pointed in Hadoop. If someone accidentally deletes data, it’s gone.

High Availability

With Hadoop 1.0, there is a single point of failure, the NameNode. The NameNode contains the fsimage file which tracks where all the data lives in the Hadoop cluster. If you lose your NameNode, you won’t be able to use your cluster, and if you don’t properly back up the fsimage, you will experience data loss. "Having to move NameNode due to a hardware failure is probably the worst crisis you can have with a Hadoop cluster.“ [2] This issue has been addressed in Hadoop 2.0, where NameNode High Availability has been implemented.

Bloom Filters

Often times in a map reduce job you want to logically combine two data sources together, or “join” them. There are a couple of methods for doing this; one way of doing this is by joining the data during the map portion of the MapReduce. This is more efficient that doing it in the reduce portion. To accomplish the join in the map portion of the job, you must be able to store one of your data sources in the memory of every cluster. But what if you can’t fit all of the data into memory? “In cases where we can accept some false positives while still guaranteeing no false negatives, a Bloom filter provides an extremely compact way of representing such information.” [3] “The use of Bloom filters is in fact a standard technique for joining in distributed databases, and it’s used in commercial products such as Oracle 11g.” [4] More information on Bloom filters & Hadoop can be found in the book Hadoop in Action in section 5.2. An example application of this can also be found here.

That’s all for now. Check back in the future for further Elephant Enlightenments!