hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Freas" <colinfr...@gmail.com>
Subject Re: Input/Output Formaters and FileTypes
Date Fri, 20 Jun 2008 21:33:57 GMT
We'd been using text input and output exclusively, but eventually realized
some efficiency improvements by using slightly more sophisticated classes
specific to our application.

Our main use of Hadoop is processing activity logs from a fleet of servers.
We get about 6GB of compressed data per day.  We were running reports based
on different dimensions in our logs.  At first, we were making a pass
through the data for each dimension.  The thing is, if we included the
dimension as part of the key, we could actually do the first MR job we need
in one pass.  But this slightly improved version of our reports still uses
the text input and output for keys, values, and output.

Where we use a custom class is when we process these intermediate results
into a final summary.  Our Summarizer class is the OutputValueClass for our
job, though the output forrmat is still text (which calls the toString
method of our Summarizer.)  Our final MR job operates on elements of
Summarizer, after deciding what to do based on the dimension label in the
key and based on certain charecteristics of the key and value from the
initial MR job.  This allows us to keep track of 4 independent tallies in
our summarizing MR job.

It was fairly easy to write the OutputValueClass, though our jobs are fairly
straightforward.  It's easy to see how it could be extended in really
interesting ways to do more though.


On Fri, Jun 20, 2008 at 1:10 PM, Mathos Marcer <mathos.marcer@gmail.com>

> Presumedly like most I've started off with mainly using "Text" based
> input and output formatters and using key and values as Text or
> IntWritable.  I've been looking more into the other formatters and
> writable classes and wondering what they would do for me.  To help
> spur some best practices and lessons learned conversations:  What are
> the benefits of the other formatters?  And benefits of MapFiles and
> SequenceFiles?  What are people out there using or have found gave
> them the greatest benefits?
> ==
> MM

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message