hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Advice wanted
Date Wed, 25 Oct 2006 20:46:17 GMT
Grant Ingersoll wrote:
> Hi,
>
> I have two tasks (although one is kind of a special case of the other) 
> I am looking for some advice on how best to take advantage of Hadoop 
> with:
>
> 1.  I have a list of tokens that need to be translated from the source 
> language to the destination language.  My approach is to take the 
> tokens, write them out to the FileSystem, one per line, and then 
> distribute (map) them onto the cluster for translation as Text.   I am 
> not sure, however, how best to pass along the metadata needed (source 
> language and destination language).  My thoughts are to add the source 
> and dest lang. to the JobConf (but I could also see encoding it into 
> the name of the file on the file system and then into the key).  Then 
> during the Map phase, I would need to either get the properties out of 
> the JobConf or decode the key to figure out the source and target 
> languages.

Source and target language are two configuration properties for the 
whole job, so passing them inside JobConf seems like the best way. Each 
map/reduce task will get the same JobConf, including your properties.

Using the standard TextInputFormat you get <lineNo, lineSrcText> in your 
map(), which you would then output after translation as <lineSrcText, 
lineTgtText>. If you accidentally have duplicate lines in the input, you 
will get multiple values in reduce, because the same lineSrcText key 
would be associated with multiple translations.

Actually, if you need to transalte this into several languages, you 
could loop in map() through all target languages, and output as many 
translation tuples as needed, as <lineSrcText, <lang, lineTgtText>> - 
then in your reduce() you would get them nicely collected under a single 
key (lineSrcText) and all translated values in Iterator.

>
> 2. This time, instead of tokens I have X number of whole documents 
> that need to be translated from source to destination and the way the 
> translation systems work, it is best to have the whole document 
> together when getting a translation.  My plan here is to implement my 
> own InputFormat again, this time returning the whole document from the 
> RecordReader.next() and overriding getSplits() in InputFormatBase to 
> return only one split per file, regardless of numSplits.  Again, I 
> would need to put the metadata somewhere, either the JobConf or the key.
>
> Is there a better way of doing this or am I on the right track?
Basically, it's ok - the only problematic aspect is that if you have millions of documents
then using this method you will get millions of map tasks to execute, because you create as
many splits (hence, map tasks) as there are files ... perhaps a better way would be to first
wrap these documents into a single SequenceFile consisting of <fileName, fileContent>,
and use SequenceFileInputFormat.


-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Mime
View raw message