hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Question about input file breakdown
Date Mon, 15 Oct 2007 21:38:16 GMT

If you have time, update the wiki FAQ on this so that the next person has an
easy time figuring this question out.


On 10/15/07 2:22 PM, "Ming Yang" <minghsien@gmail.com> wrote:

> thank you! after tracing the code I realized that I should override
> getRecordReader(...) as well to return the whole content of the file,
> ie. to finish the job. :)
> 
> 2007/10/15, Ted Dunning <tdunning@veoh.com>:
>> 
>> 
>> You didn't do anything wrong.  You just didn't finish the job.
>> 
>> You need to override getRecordReader as well so that it returns the contents
>> of the file (or a lazy version of same) as a single record.
>> 
>> 
>> On 10/15/07 11:00 AM, "Ming Yang" <minghsien@gmail.com> wrote:
>> 
>>> I just did a test by simply extending from TextInputFormat
>>> and override isSplitable(FileSystem fs, Path file) to always
>>> returning false. However, in my mapper, I still see the input
>>> file gets splitted into lines. I did set the input format in
>>> JobConfiguration and isSplitable(...) -> false did get called
>>> during job execution. Is there anything I did wrong or
>>> this is the behavior I should be expecting?
>>> 
>>> Thanks,
>>> 
>>> Ming
>>> 
>>> 2007/10/15, Ted Dunning <tdunning@veoh.com>:
>>>> 
>>>> That doesn't quite do what the poster requested.  They wanted to pass the
>>>> entire file to the mapper.
>>>> 
>>>> That requires a custom input format or an indirect input approach (list of
>>>> file names in input).
>>>> 
>>>> 
>>>> On 10/15/07 9:57 AM, "Rick Cox" <rick.cox@gmail.com> wrote:
>>>> 
>>>>> You can also gzip each input file. Hadoop will not split a compressed
>>>>> input file (but will automatically decompress it before feeding it to
>>>>> your mapper).
>>>>> 
>>>>> rick
>>>>> 
>>>>> On 10/15/07, Ted Dunning <tdunning@veoh.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> Use a list of file names as your map input.  Then your mapper can
read a
>>>>>> line, use that to open and read a file for processing.
>>>>>> 
>>>>>> This is similar to the problem of web-crawling where the input is
a list
>>>>>> of
>>>>>> URL's.
>>>>>> 
>>>>>> On 10/15/07 6:57 AM, "Ming Yang" <minghsien@gmail.com> wrote:
>>>>>> 
>>>>>>> I was writing a test mapreduce program and noticed that the
>>>>>>> input file was always broken down into separate lines and fed
>>>>>>> to the mapper. However, in my case I need to process the whole
>>>>>>> file in the mapper since there are some dependency between
>>>>>>> lines in the input file. Is there any way I can achieve this
--
>>>>>>> process the whole input file, either text or binary, in the mapper?
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 


Mime
View raw message