Skip to content

Latest commit

 

History

History
23 lines (14 loc) · 3.04 KB

XX51-java_api.asciidoc

File metadata and controls

23 lines (14 loc) · 3.04 KB

Standalone Hadoop Java Programs

When to use the Direct Hadoop Java API

Don’t.

Why you Shouldn’t Use the Direct Hadoop Java API

Instead, write UDFs against a high-level framework such as Pig, Hive or Cascading.

On the surface, directly coding against Hadoop’s Map/Reduce interface promises the fastest possible execution speed. But not only is that wrong metric to consider, the practical performance advantage is often strongly in favor of framework code.

As we’ll describe more in the performance and tuning chapters (REF), Hadoop jobs scale nearly linearly and sometimes sublinearly (doubling the cluster size better than halves the execution time). If your concern is the start-to-finish runtime of a job, the surest and cheapest solution is to use more computers. If you concern is cost, you should consider what we call cost of insight: not only cluster costs but also development time, maintenance costs, and agility of iteration. Even for projects that are so large and frequently run that cluster costs dominate, we recommend first prototyping in a high-level framework and then judiciously rewriting portions as needed. (These frameworks all allow you to call out to standalone map/reduce programs from within a job.)

The primary problem is that programs written against the direct API require a large surrounding mass of boilerplate code: configuration handling, data structures, serialization, error handling and so forth. This surrounding code is ugly and boring, and in practice takes more time, produces more bugs, and carries a higher maintenance burden than the important stuff. In effect, your codebase will asymptotically approach a crappy subset of Pig, Hive or Cascading.

More importantly, the high-level framework provides an implementation far better than it’s worth your time to recreate. They evolved to solve the same important problems you will hit. Pig, Hive and Cascading all offer large-scale optimizations which avoid moving data to disk or over the network. Compared to the marginal gains offered by direct framework access, those optimizations often yield a several multiples speedup to the slowest stages of your job. Rolling your own code can also hold rude surprises in store. As an example, many people will make the easy mistake of using Java collection classes to hold grouped records in memory. This will seem to work well in development, but explode when the code is run on production loads (or even worse, several months later when the data size increases above some threshold). The Pig DataBag structure, on the other hand, keeps tabs on overall memory consumption and will intelligently spill to disk, providing the interface of an infinitely large data structure that works in symphony with how Hadoop streams data.

The high-level frameworks dramatically reduce the amount of code you must write and maintain, offer practical performance advantages at high scale, and do not significantly reduce your power to write and integrate arbitrary code. What matters is scalability of your team and your workload, not execution time.