hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <qwertyman...@gmail.com>
Subject Re: MapReduce on binary data
Date Fri, 06 Aug 2010 07:41:31 GMT
Hello,

One can run MR jobs on compressed files stored on the Hadoop DFS;
without needing to decompress chunks out of it manually. But there are
certain exceptions.

It'd help to know what kind of compression algorithm is being applied.
For instance, large blocks of GZip or DEFLATE files can't be 'split'
to many mappers; But BZip2 (and LZO with block index preparation) can
be split into blocks and each may be assigned a mapper, thus
parallelizing your execution.

IMO, a proper way to concatenate lots of 'small' files (say, hourly),
into one large file for efficient Hadoop MR execution on them would be
to use the provided SequenceFile format. Its a key-value storage and
can hold all your data into one file as <hour-filename-string (key),
collected-data (value)> (for example).

Then you could use SequenceFileInputFormat and can break down your
large value chunk into lines and etc, and then emit each key-val pair
as you need. The advantage here is that you can also compress these
files based on BLOCK (sequences of values are compressed together as a
block) or RECORD (each value is compressed independently). You can
also specify what compression technique to use (GZip, BZip2, LZO are
some of the available ones).

API reference: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html
More reading:http://www.cloudera.com/blog/2009/02/the-small-files-problem/

Doing it this way also gives you your 'raw ascii data' in its own form
inside your mapper (say Text or LongWritable, etc...). The
decompression is performed by the used input-format and record-readers
itself, abstracted away from the user.

On Fri, Aug 6, 2010 at 12:25 PM, fred smith <dopey483@gmail.com> wrote:
> Hi,
>
> I am playing with netflow data on my small hadoop cluster (20 nodes)
> just trying things out. I am a beginner on hadoop so please be gentle
> with me.
>
> I am currently running map reduce jobs on text (eg;formatted) netflow
> files. They are already processed with flow-tools
> (http://code.google.com/p/flow-tools/). I use streaming and python,
> rather than coding in java, and it all works ok.
>
> The issue I am facing is performance. I can concatenate one day's
> formatted logs into a single file, and this will be about 18GB in
> size. So, 18GB per day will be around 6.5TB of files per year.
> But it takes a long time to do this, and is slow to process after.
>
> The original data is heavily compressed - flow-tools is extremely
> efficient at that! I am trying to sort out if I can do anything with
> the binary datasets and so save space and hopefully get better
> performance as well.
>
> Currently:
> flow-tool file ----> flow-cat & flow-print -----> formatted text file
> 3GB binary  ------------------------------------------> 18GB ASCII
>
> The problem is that I don't think I can get the  binary files to be
> processed efficiently because of the compression can I? I can't split
> the binary file and have it processed in parallel? Sorry for such a
> basic question but I am having trouble imagining how the map reduce
> with work with binary files.
>
> Should I forget about binary and stop worrying about it being 6.5TB as
> that isn't a lot in  hadoop terms!
>
> Thanks
> Paul
>



-- 
Harsh J
www.harshj.com

Mime
View raw message