hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Advice wanted
Date Thu, 26 Oct 2006 18:10:20 GMT
Andrzej Bialecki wrote:
> Grant Ingersoll wrote:
>> 2. This time, instead of tokens I have X number of whole documents 
>> that need to be translated from source to destination and the way the 
>> translation systems work, it is best to have the whole document 
>> together when getting a translation.  My plan here is to implement my 
>> own InputFormat again, this time returning the whole document from the 
>> RecordReader.next() and overriding getSplits() in InputFormatBase to 
>> return only one split per file, regardless of numSplits.  Again, I 
>> would need to put the metadata somewhere, either the JobConf or the key.
>>
>> Is there a better way of doing this or am I on the right track?
> Basically, it's ok - the only problematic aspect is that if you have 
> millions of documents then using this method you will get millions of 
> map tasks to execute, because you create as many splits (hence, map 
> tasks) as there are files ... perhaps a better way would be to first 
> wrap these documents into a single SequenceFile consisting of <fileName, 
> fileContent>, and use SequenceFileInputFormat.

Another approach to this is to create a file listing the names of the 
files in a big flat text file, then use that file as the input, with 
TextInputFormat.  Then map() will be passed file names, and can open 
them, translate them and collect the output.  That avoids having to 
append the content of all the files.

Doug

Mime
View raw message