hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Balatero <ezwe...@u.washington.edu>
Subject Re: Parsing a directory of 300,000 HTML files?
Date Thu, 25 Oct 2007 00:43:51 GMT
I like your style regarding pre-processing into a few large files  
with Hadoop. I think I may go that route, unless anyone else has any  
brilliant ideas.

- David

On Oct 24, 2007, at 5:29 PM, Ted Dunning wrote:

>
>
> File open time is an issue if you have lots and lots of little files.
>
> If you are doing this analysis once or a few times, then it isn't  
> worth
> reformatting into a few larger files.
>
> If you are likely to do this analysis dozens of times, then opening  
> larger
> files will probably give you a significant benefit in terms of  
> runtime.
>
> If the runtime isn't terribly important, then the filename per line  
> approach
> will work fine.
>
> Note that the filename per line approach is a great way to do the
> pre-processing into a few large files which will then be analyzed  
> faster.
>
> On 10/24/07 5:09 PM, "David Balatero" <ezwelty@u.washington.edu>  
> wrote:
>
>> I have a corpus of 300,000 raw HTML files that I want to read in and
>> parse using Hadoop. What is the best input file format to use in this
>> case? I want to have access to each page's raw HTML in the mapper, so
>> I can parse from there.
>>
>> I was thinking of preprocessing all the files, removing the new
>> lines, and putting them in a big <key, value> file:
>>
>> url1, html with stripped new lines
>> url2, ....
>> url3, ....
>> ...
>> urlN, ....
>>
>> I'd rather not do all this preprocessing, just to wrangle the text
>> into Hadoop. Any other suggestions? What if I just stored the path to
>> the HTML file in a <key, value> type
>>
>> url1, path_to_file1
>> url2, path_to_file2
>> ...
>> urlN, path_to_fileN
>>
>> Then in the mapper, I could read each file in from the DFS on the
>> fly. Anyone have any other good ideas? I feel like there's some key
>> function that I'm just stupidly overlooking...
>>
>> Thanks!
>> David Balatero
>


Mime
View raw message