hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: map/reduce, large content files and distributed filesystem question
Date Tue, 20 Nov 2007 17:17:50 GMT
Eugeny N Dzhurinsky wrote:
> On Tue, Nov 20, 2007 at 05:42:28PM +0100, Andrzej Bialecki wrote:
>> I work with an application that faces a somewhat similar problem. If you 
>> keep using large keys, at some point tasks will start failing with 
>> OutOfMemory exception, because Hadoop needs to keep many keys completely in 
>> RAM ...
>> The solution I found was to replace large keys with equivalent small keys, 
>> ie. hashes. Large content (both keys and values) are stored in HDFS. For 
>> large keys (which are still somewhat smaller than values) I use a MapFile 
>> of <hash, largeKey>, and large values I store in regular files on HDFS, 
>> named by the hash.
>> This way, I can use <hash, largeKey> as the input to map-reduce, and 
>> retrieve large values lazily from HDFS from a pre-defined path + "/" + 
>> hash.
> Okay, good point, thank you. So as far as I understand, in the Mapping job I
> will take the blob, store it into HDFS using Path built in some way, and then
> obtain InputStream from the file somehow? Or I just missed something?

Yes, exactly like this.

>>> 2) Our application needs to get data from these saved BLOBs later, as well 
>>> as
>>> some meta-data associated with each BLOB. Moreover, there should be an 
>>> ability
>>> to find the BLOB using certain criteria. As far as I understand 
>>> Map/Reduce, it
>> The question is whether you need this search to occur as a batch job, or in 
>> real-time. The solution I outlined above allows you to quickly lookup the 
>> largeKey and also locate a file with the BLOB.
> Well, I was thinking about scheduling a job with "search conditions" as an
> input, which job will perform the search and return key/value pairs somehow,
> which will be used later. Those key/value will be used to refer to another
> blobs, if that matters. And there could be several jobs of such kind in the
> system at the same time, so I would need to separate them somehow to avoid
> data interference.

I probably don't understand your requirements - what you described looks 
like one of the example applications supplied with Hadoop, namely Grep. 
This application selects specific input records according to a pattern.

>> Input data is split into portions with the help of Partitioner, and a 
>> single map task is assigned to each part. Then map tasks are allocated to 
>> as many nodes as needed, taking into consideration the job priority, 
>> cluster capacity, max. number of tasks per node, etc.
> Looks like I might be missing some of core concept of Hadoop. As far as I
> understood, the data is always kept on the same node which executes a job,
> which job produces the data. This means the blobs will be store on the same
> node where the job was started, and another nodes will not know anything about
> such blobs, so then doing some search - the jobs will be started at all nodes,
> and some of the jobs will return data, and some will not, but end user will be
> presented with merged data?

If you use HDFS, and you run datanodes and tasktrackers on each node, 
then Hadoop will try to execute map tasks on the node that holds the 
part of input data allocated to the task. We are talking about 
individual data blocks here - a single file consists of many data blocks 
and they may be located on many datanodes. In other words, parts of your 
blob will be located on different machines across the cluster. HDFS 
hides this and provides a simple InputStream to retrieve the whole content.

Then, each reduce task will produce one part of the output, again as a 
file on HDFS - so you will possibly get multiple files on HDFS which 
together form the output of a job. These files are located on HDFS, so 
again individual data blocks may be located anywhere on the cluster, but 
using the FileSystem.open(...) you can retrieve the complete content of 
such files as InputStream (or any other more complex stream, such as 
DataInputStream, SequenceFile.Reader, MapFile.Reader, etc).

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

View raw message