hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: map/reduce, large content files and distributed filesystem question
Date Tue, 20 Nov 2007 16:42:28 GMT
Eugeny N Dzhurinsky wrote:
> Hello, gentlemen!
> We are trying to adapt hadoop to suit our application (or mostly adapt our
> application to fit Map/Reduce and hadoop;) ), and I have several questions:
> 1) when doing mapping part of a job, our application creates some BLOB, which
> we will need to save and then re-use in another part of the application. This
> blob has an unique but large key, which is coming from the domain of the
> application, and in nature the key is a string. The content of this BLOB is
> being used within the mapping job, so first of all we need to obtain the
> content (BLOB), pass the stream created on this BLOB to another legacy part of
> our application, which does not need to know about the hadoop (in general it
> just accepts an instance of java.io.InputStream), and then return key/value
> pair to reducer (where the key is original unique key we adopted, and the
> value is the BLOB we got after processing by the mapping job).
> After looking at hadoop api documentation, I found there are several
> implementations of OutputFormat available, however I'm not sure which one I
> need to use to keep large value (which could be several tens of megabytes of
> size)? Also how would we get content of the BLOB to pass it back to
> legacy application? We want to avoid keeping the blob in RAM because of it's
> size.

Hi Eugeny,

I work with an application that faces a somewhat similar problem. If you 
keep using large keys, at some point tasks will start failing with 
OutOfMemory exception, because Hadoop needs to keep many keys completely 
in RAM ...

The solution I found was to replace large keys with equivalent small 
keys, ie. hashes. Large content (both keys and values) are stored in 
HDFS. For large keys (which are still somewhat smaller than values) I 
use a MapFile of <hash, largeKey>, and large values I store in regular 
files on HDFS, named by the hash.

This way, I can use <hash, largeKey> as the input to map-reduce, and 
retrieve large values lazily from HDFS from a pre-defined path + "/" + hash.

> 2) Our application needs to get data from these saved BLOBs later, as well as
> some meta-data associated with each BLOB. Moreover, there should be an ability
> to find the BLOB using certain criteria. As far as I understand Map/Reduce, it

The question is whether you need this search to occur as a batch job, or 
in real-time. The solution I outlined above allows you to quickly lookup 
the largeKey and also locate a file with the BLOB.

> should be possible to spawn set of jobs, which would be executed on different
> data nodes, and results of searches returned by these jobs will be collected
> and "reduced" later. The question here  - does Hadoop take care of execution
> as many jobs as data nodes are available in the system, and pass the jobs to
> the remote node with same input data used to start the job?

Input data is split into portions with the help of Partitioner, and a 
single map task is assigned to each part. Then map tasks are allocated 
to as many nodes as needed, taking into consideration the job priority, 
cluster capacity, max. number of tasks per node, etc.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

View raw message