hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Narinder Kumar <nku...@inphina.com>
Subject Map-Reduce Applicability With All-In Memory Data
Date Tue, 30 Nov 2010 11:06:40 GMT
Hi All,

We have a problem in hand which we would like to solve using Distributed and
Parallel Processing.

Brief context : We have a Map (Entity, Associated value). The entity can
have a parent which in turn will have its parent and so on till we reach the
head. I have to traverse this tree and do some calculations at every step.
Now as we can see the tree structure can be quite deep and we have a huge
list of these Maps to process before coming to the final result. Processing
them sequentially takes quite long time. We were thinking of using
Map-Reduce to split the computation across multiple nodes in a Hadoop
Cluster and then aggregate the results to get the final output.

Having a quick read at the documentation and the samples, I see that both
Mapper and Reducer work with implementations of InputFormat and OutPutFormat
respectively. All their implementations appeared to me to be either File or
DB based. Do we have some input-output format which directly takes/updates
things in Memory ?

In order to be able to use Map-Reduce, my understanding is in terms of
following steps :

   - Put starting list of Maps (Entity, Value) into one/multiple input files
   - Use these files as input for Mapper, do the necessary traversal till
   the head of the tree and corresponding calculation there
   - Emit the results of Mapper in output files
   - Use Reducer to aggregate/combine these results accordingly

The potential issues I see in this approach is that I will have to do
round-trip in terms of taking the data out from Memory to files and then
back again into Memory which might cause me performance hits again. Is this
some what correct approach for using Map-Reduce in my context or I am
missing the point completely ?

Further, I would like to know whether Map-Reduce is the appropriate platform
for these kind of scenarios or we should think of it more for huge DB/File
based data only ?

Best Regards

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message