hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From oleksiy <gayduk.a.s...@mail.ru>
Subject Architectural question
Date Sun, 10 Apr 2011 21:11:58 GMT

Hi all,
I have some architectural question.
For my app I have persistent 50 GB data, which stored in HDFS, data is
simple CSV format file.
Also for my app which should be run over this (50 GB) data I have 10 GB
input data also CSV format.
Persistent data and input data don't have commons keys.

In my cluster I have 5 data nodes.
The app does simple match every line of input data with every line of
persistent data.

For solving this task I see two different approaches:
1. Destribute input file to every node using attribute -files, and run job.
But in this case every map will go through 10 GB input data.
2. Devide input file (10 GB) to 5 parts (for instance), run 5 independent
jobs (one per data node for instance), and for every job we will put 2 GB
data. In this case every map should go through 2 GB data. In other words
I'll give every map node it's own input data. But drawback of this approache
is work which I should do before start job and after job finished.

And may be there is more subtle way in hadoop to do this work?

View this message in context: http://old.nabble.com/Architectural-question-tp31365870p31365870.html
Sent from the Hadoop core-dev mailing list archive at Nabble.com.

View raw message