hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enis Soztutar <enis.soz.nu...@gmail.com>
Subject Re: Parsing a directory of 300,000 HTML files?
Date Thu, 25 Oct 2007 06:40:04 GMT
I can think of two ways to do this :
1. Use MultiFileInputFormat for this job, and split the input so that 
each mapper gets many files to read.
2. First pack the files into one large SequenceFile of <url,html> pairs. 
Then use SequenceFileInputFormat.

David Balatero wrote:
> I like your style regarding pre-processing into a few large files with 
> Hadoop. I think I may go that route, unless anyone else has any 
> brilliant ideas.
>
> - David
>
> On Oct 24, 2007, at 5:29 PM, Ted Dunning wrote:
>
>>
>>
>> File open time is an issue if you have lots and lots of little files.
>>
>> If you are doing this analysis once or a few times, then it isn't worth
>> reformatting into a few larger files.
>>
>> If you are likely to do this analysis dozens of times, then opening 
>> larger
>> files will probably give you a significant benefit in terms of runtime.
>>
>> If the runtime isn't terribly important, then the filename per line 
>> approach
>> will work fine.
>>
>> Note that the filename per line approach is a great way to do the
>> pre-processing into a few large files which will then be analyzed 
>> faster.
>>
>> On 10/24/07 5:09 PM, "David Balatero" <ezwelty@u.washington.edu> wrote:
>>
>>> I have a corpus of 300,000 raw HTML files that I want to read in and
>>> parse using Hadoop. What is the best input file format to use in this
>>> case? I want to have access to each page's raw HTML in the mapper, so
>>> I can parse from there.
>>>
>>> I was thinking of preprocessing all the files, removing the new
>>> lines, and putting them in a big <key, value> file:
>>>
>>> url1, html with stripped new lines
>>> url2, ....
>>> url3, ....
>>> ...
>>> urlN, ....
>>>
>>> I'd rather not do all this preprocessing, just to wrangle the text
>>> into Hadoop. Any other suggestions? What if I just stored the path to
>>> the HTML file in a <key, value> type
>>>
>>> url1, path_to_file1
>>> url2, path_to_file2
>>> ...
>>> urlN, path_to_fileN
>>>
>>> Then in the mapper, I could read each file in from the DFS on the
>>> fly. Anyone have any other good ideas? I feel like there's some key
>>> function that I'm just stupidly overlooking...
>>>
>>> Thanks!
>>> David Balatero
>>
>
>

Mime
View raw message