hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Advice wanted
Date Thu, 26 Oct 2006 00:40:59 GMT

> Grant Ingersoll wrote:
>> Hi,
>> I have two tasks (although one is kind of a special case of the  
>> other) I am looking for some advice on how best to take advantage  
>> of Hadoop with:
>> 1.  I have a list of tokens that need to be translated from the  
>> source language to the destination language.  My approach is to  
>> take the tokens, write them out to the FileSystem, one per line,  
>> and then distribute (map) them onto the cluster for translation as  
>> Text.   I am not sure, however, how best to pass along the  
>> metadata needed (source language and destination language).  My  
>> thoughts are to add the source and dest lang. to the JobConf (but  
>> I could also see encoding it into the name of the file on the file  
>> system and then into the key).  Then during the Map phase, I would  
>> need to either get the properties out of the JobConf or decode the  
>> key to figure out the source and target languages.
> Source and target language are two configuration properties for the  
> whole job, so passing them inside JobConf seems like the best way.  
> Each map/reduce task will get the same JobConf, including your  
> properties.

Cool, I figured JobConf was also distributed, but wasn't 100% certain.

> Using the standard TextInputFormat you get <lineNo, lineSrcText> in  
> your map(), which you would then output after translation as  
> <lineSrcText, lineTgtText>. If you accidentally have duplicate  
> lines in the input, you will get multiple values in reduce, because  
> the same lineSrcText key would be associated with multiple  
> translations.

Makes sense.

> Actually, if you need to transalte this into several languages, you  
> could loop in map() through all target languages, and output as  
> many translation tuples as needed, as <lineSrcText, <lang,  
> lineTgtText>> - then in your reduce() you would get them nicely  
> collected under a single key (lineSrcText) and all translated  
> values in Iterator.

No need there, each job will be one language pair, but is is an  
interesting idea that may be worth pursuing down the road.

>> 2. This time, instead of tokens I have X number of whole documents  
>> that need to be translated from source to destination and the way  
>> the translation systems work, it is best to have the whole  
>> document together when getting a translation.  My plan here is to  
>> implement my own InputFormat again, this time returning the whole  
>> document from the RecordReader.next() and overriding getSplits()  
>> in InputFormatBase to return only one split per file, regardless  
>> of numSplits.  Again, I would need to put the metadata somewhere,  
>> either the JobConf or the key.
>> Is there a better way of doing this or am I on the right track?
> Basically, it's ok - the only problematic aspect is that if you  
> have millions of documents then using this method you will get  
> millions of map tasks to execute, because you create as many splits  
> (hence, map tasks) as there are files ... perhaps a better way  
> would be to first wrap these documents into a single SequenceFile  
> consisting of <fileName, fileContent>, and use  
> SequenceFileInputFormat.

OK, that makes more sense.  I wasn't totally clear on SeqFile, but  
based on what you said and looking at it again that seems like a much  
better way to handle it.


View raw message