hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Advice wanted
Date Wed, 25 Oct 2006 20:08:18 GMT

I have two tasks (although one is kind of a special case of the  
other) I am looking for some advice on how best to take advantage of  
Hadoop with:

1.  I have a list of tokens that need to be translated from the  
source language to the destination language.  My approach is to take  
the tokens, write them out to the FileSystem, one per line, and then  
distribute (map) them onto the cluster for translation as Text.   I  
am not sure, however, how best to pass along the metadata needed  
(source language and destination language).  My thoughts are to add  
the source and dest lang. to the JobConf (but I could also see  
encoding it into the name of the file on the file system and then  
into the key).  Then during the Map phase, I would need to either get  
the properties out of the JobConf or decode the key to figure out the  
source and target languages.

2. This time, instead of tokens I have X number of whole documents  
that need to be translated from source to destination and the way the  
translation systems work, it is best to have the whole document  
together when getting a translation.  My plan here is to implement my  
own InputFormat again, this time returning the whole document from  
the RecordReader.next() and overriding getSplits() in InputFormatBase  
to return only one split per file, regardless of numSplits.  Again, I  
would need to put the metadata somewhere, either the JobConf or the key.

Is there a better way of doing this or am I on the right track?


View raw message