hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eugeny N Dzhurinsky <b...@redwerk.com>
Subject Re: map/reduce, large content files and distributed filesystem question
Date Tue, 20 Nov 2007 16:57:57 GMT
On Tue, Nov 20, 2007 at 05:42:28PM +0100, Andrzej Bialecki wrote:
> I work with an application that faces a somewhat similar problem. If you 
> keep using large keys, at some point tasks will start failing with 
> OutOfMemory exception, because Hadoop needs to keep many keys completely in 
> RAM ...
> The solution I found was to replace large keys with equivalent small keys, 
> ie. hashes. Large content (both keys and values) are stored in HDFS. For 
> large keys (which are still somewhat smaller than values) I use a MapFile 
> of <hash, largeKey>, and large values I store in regular files on HDFS, 
> named by the hash.
> This way, I can use <hash, largeKey> as the input to map-reduce, and 
> retrieve large values lazily from HDFS from a pre-defined path + "/" + 
> hash.

Okay, good point, thank you. So as far as I understand, in the Mapping job I
will take the blob, store it into HDFS using Path built in some way, and then
obtain InputStream from the file somehow? Or I just missed something?

>> 2) Our application needs to get data from these saved BLOBs later, as well 
>> as
>> some meta-data associated with each BLOB. Moreover, there should be an 
>> ability
>> to find the BLOB using certain criteria. As far as I understand 
>> Map/Reduce, it
> The question is whether you need this search to occur as a batch job, or in 
> real-time. The solution I outlined above allows you to quickly lookup the 
> largeKey and also locate a file with the BLOB.

Well, I was thinking about scheduling a job with "search conditions" as an
input, which job will perform the search and return key/value pairs somehow,
which will be used later. Those key/value will be used to refer to another
blobs, if that matters. And there could be several jobs of such kind in the
system at the same time, so I would need to separate them somehow to avoid
data interference.

>> should be possible to spawn set of jobs, which would be executed on 
>> different
>> data nodes, and results of searches returned by these jobs will be 
>> collected
>> and "reduced" later. The question here  - does Hadoop take care of 
>> execution
>> as many jobs as data nodes are available in the system, and pass the jobs 
>> to
>> the remote node with same input data used to start the job?
> Input data is split into portions with the help of Partitioner, and a 
> single map task is assigned to each part. Then map tasks are allocated to 
> as many nodes as needed, taking into consideration the job priority, 
> cluster capacity, max. number of tasks per node, etc.

Looks like I might be missing some of core concept of Hadoop. As far as I
understood, the data is always kept on the same node which executes a job,
which job produces the data. This means the blobs will be store on the same
node where the job was started, and another nodes will not know anything about
such blobs, so then doing some search - the jobs will be started at all nodes,
and some of the jobs will return data, and some will not, but end user will be
presented with merged data?

Eugene N Dzhurinsky

View raw message