File open time is an issue if you have lots and lots of little files.
If you are doing this analysis once or a few times, then it isn't worth
reformatting into a few larger files.
If you are likely to do this analysis dozens of times, then opening larger
files will probably give you a significant benefit in terms of runtime.
If the runtime isn't terribly important, then the filename per line approach
will work fine.
Note that the filename per line approach is a great way to do the
pre-processing into a few large files which will then be analyzed faster.
On 10/24/07 5:09 PM, "David Balatero" <ezwelty@u.washington.edu> wrote:
> I have a corpus of 300,000 raw HTML files that I want to read in and
> parse using Hadoop. What is the best input file format to use in this
> case? I want to have access to each page's raw HTML in the mapper, so
> I can parse from there.
>
> I was thinking of preprocessing all the files, removing the new
> lines, and putting them in a big <key, value> file:
>
> url1, html with stripped new lines
> url2, ....
> url3, ....
> ...
> urlN, ....
>
> I'd rather not do all this preprocessing, just to wrangle the text
> into Hadoop. Any other suggestions? What if I just stored the path to
> the HTML file in a <key, value> type
>
> url1, path_to_file1
> url2, path_to_file2
> ...
> urlN, path_to_fileN
>
> Then in the mapper, I could read each file in from the DFS on the
> fly. Anyone have any other good ideas? I feel like there's some key
> function that I'm just stupidly overlooking...
>
> Thanks!
> David Balatero
|