hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From fred smith <dopey...@gmail.com>
Subject MapReduce on binary data
Date Fri, 06 Aug 2010 06:55:16 GMT

I am playing with netflow data on my small hadoop cluster (20 nodes)
just trying things out. I am a beginner on hadoop so please be gentle
with me.

I am currently running map reduce jobs on text (eg;formatted) netflow
files. They are already processed with flow-tools
(http://code.google.com/p/flow-tools/). I use streaming and python,
rather than coding in java, and it all works ok.

The issue I am facing is performance. I can concatenate one day's
formatted logs into a single file, and this will be about 18GB in
size. So, 18GB per day will be around 6.5TB of files per year.
But it takes a long time to do this, and is slow to process after.

The original data is heavily compressed - flow-tools is extremely
efficient at that! I am trying to sort out if I can do anything with
the binary datasets and so save space and hopefully get better
performance as well.

flow-tool file ----> flow-cat & flow-print -----> formatted text file
3GB binary  ------------------------------------------> 18GB ASCII

The problem is that I don't think I can get the  binary files to be
processed efficiently because of the compression can I? I can't split
the binary file and have it processed in parallel? Sorry for such a
basic question but I am having trouble imagining how the map reduce
with work with binary files.

Should I forget about binary and stop worrying about it being 6.5TB as
that isn't a lot in  hadoop terms!


View raw message