hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dhruba Borthakur <dhr...@gmail.com>
Subject Re: CombineFileInput Format vs. Large SequenceFile
Date Sat, 24 Oct 2009 09:06:34 GMT
you could use both. if you query is to be done once, then either way is ok.
If the query is to be done multiple times, it is better to create a large
sequencefile as the first step and then continue to query that large file
multiple times, otherwise you might land up  transferring all the data
across the network multiple times, one for every run of the query.

thanks
dhruba

On Fri, Oct 23, 2009 at 12:52 PM, Rajiv Maheshwari <rajivm01@yahoo.com>wrote:

> Hi everyone,
>
> I have a need to process a large number (millions) of relatively small XML
> files (I would imagine mostly 1K to 1 MB). I guess sending each file to a
> map-reduce task will cause too much overhead in setup and teardown of the
> tasks. So, I am considering 2 alternatives:
>
> 1) Generate 1 large SequenceFile with <K,V> = <Filename/URI, File XML
> content> for all the files. This SequenceFile would be huge. (I wonder is
> there any max record length / max size limit on HDFS file?)
>
> 2) Use CombineFileInputFormat
>
> I would appreciate any comments on performance considerations and other
> pros and cons.
>
> One more question: What if I have one large XML file composed of (a series
> of) each file's XML content, could it be possible to use StreamXMLReader
> while combining component files before sending to map-reduce task?
>
> Thanks,
> Rajiv
>
>
>




-- 
Connect to me at http://www.facebook.com/dhruba

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message