hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rajiv Maheshwari <rajiv...@yahoo.com>
Subject CombineFileInput Format vs. Large SequenceFile
Date Fri, 23 Oct 2009 19:52:19 GMT
Hi everyone,

I have a need to process a large number (millions) of relatively small XML files (I would
imagine mostly 1K to 1 MB). I guess sending each file to a map-reduce task will cause too
much overhead in setup and teardown of the tasks. So, I am considering 2 alternatives:

1) Generate 1 large SequenceFile with <K,V> = <Filename/URI, File XML content>
for all the files. This SequenceFile would be huge. (I wonder is there any max record length
/ max size limit on HDFS file?)

2) Use CombineFileInputFormat

I would appreciate any comments on performance considerations and other pros and cons.

One more question: What if I have one large XML file composed of (a series of) each file's
XML content, could it be possible to use StreamXMLReader while combining component files before
sending to map-reduce task?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message