hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anis Ahmed" <anis...@gmail.com>
Subject Question on static chunking.
Date Wed, 24 Jan 2007 14:20:28 GMT

I need to solve the following problem in MAP REDUCE paradigm, looking for

I have a million entries in a file...line by line, which is my input.
I have a series of hadoop jobs which work on these entries, they work on one
entry at a time, but for one specific hadoop job I need to look at 50
entries at a time and analyze them then do some Bizz logic. My problem has
been to access  exact 50 entries in one go.

Options i am thinking are....

1. Do the processing as part of REDUCE. I will ensure that i use the same
intermediate key for a batch of 50 entries inside MAP. (have a static
counter, for every 50 change the intermediate key and so on) so that REDUCE
will get an iterator of 50

2. The option above has a lot of I/O, sorting etc. So instead...
Inside MAP, create a in mem pool (intialized in configure() ) and when 50 is
reached do the Bizz logic and clear pool.

I was looking to see if there is any better way to statically group entries
by a pre-deteremined number and process them in hadoop


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message