hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Balatero <ezwe...@u.washington.edu>
Subject Parsing a directory of 300,000 HTML files?
Date Thu, 25 Oct 2007 00:09:49 GMT
I have a corpus of 300,000 raw HTML files that I want to read in and  
parse using Hadoop. What is the best input file format to use in this  
case? I want to have access to each page's raw HTML in the mapper, so  
I can parse from there.

I was thinking of preprocessing all the files, removing the new  
lines, and putting them in a big <key, value> file:

url1, html with stripped new lines
url2, ....
url3, ....
urlN, ....

I'd rather not do all this preprocessing, just to wrangle the text  
into Hadoop. Any other suggestions? What if I just stored the path to  
the HTML file in a <key, value> type

url1, path_to_file1
url2, path_to_file2
urlN, path_to_fileN

Then in the mapper, I could read each file in from the DFS on the  
fly. Anyone have any other good ideas? I feel like there's some key  
function that I'm just stupidly overlooking...

David Balatero

View raw message