hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Balatero <ezwe...@u.washington.edu>
Subject Parsing a directory of 300,000 HTML files?
Date Thu, 25 Oct 2007 00:09:49 GMT
I have a corpus of 300,000 raw HTML files that I want to read in and  
parse using Hadoop. What is the best input file format to use in this  
case? I want to have access to each page's raw HTML in the mapper, so  
I can parse from there.

I was thinking of preprocessing all the files, removing the new  
lines, and putting them in a big <key, value> file:

url1, html with stripped new lines
url2, ....
url3, ....
...
urlN, ....

I'd rather not do all this preprocessing, just to wrangle the text  
into Hadoop. Any other suggestions? What if I just stored the path to  
the HTML file in a <key, value> type

url1, path_to_file1
url2, path_to_file2
...
urlN, path_to_fileN

Then in the mapper, I could read each file in from the DFS on the  
fly. Anyone have any other good ideas? I feel like there's some key  
function that I'm just stupidly overlooking...

Thanks!
David Balatero

Mime
View raw message