hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mehmet Tepedelenlioglu <mehmets...@gmail.com>
Subject Re: Architectural question
Date Sun, 10 Apr 2011 23:29:08 GMT
My understanding is you have two sets of strings S1, and S2 and you want to mark all strings
that
belong to both sets. If this is correct, then:

Mapper: for all strings K in Si (i is 1 or 2) emit: key K and value i.
Reducer: For key K, if the list of values includes both 1 and 2, you have a match, emit: K
MATCH, else emit: K NO_MATCH (or nothing).

I assume that the load is not terribly unbalanced. The logic goes for intersection of any
number of sets. Mark the members with their sets, reduce over them to see if they belong to
every set.

Good luck.


On Apr 10, 2011, at 2:10 PM, oleksiy wrote:

> 
> Hi all,
> I have some architectural question.
> For my app I have persistent 50 GB data, which stored in HDFS, data is
> simple CSV format file.
> Also for my app which should be run over this (50 GB) data I have 10 GB
> input data also CSV format.
> Persistent data and input data don't have commons keys.
> 
> In my cluster I have 5 data nodes.
> The app does simple match every line of input data with every line of
> persistent data.
> 
> For solving this task I see two different approaches:
> 1. Destribute input file to every node using attribute -files, and run job.
> But in this case every map will go through 10 GB input data.
> 2. Devide input file (10 GB) to 5 parts (for instance), run 5 independent
> jobs (one per data node for instance), and for every job we will put 2 GB
> data. In this case every map should go through 2 GB data. In other words
> I'll give every map node it's own input data. But drawback of this approache
> is work which I should do before start job and after job finished.
> 
> And may be there is more subtle way in hadoop to do this work?
> 
> -- 
> View this message in context: http://old.nabble.com/Architectural-question-tp31365863p31365863.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> 


Mime
View raw message