hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Graham <billgra...@gmail.com>
Subject Re: Writing small files to one big file in hdfs
Date Tue, 21 Feb 2012 18:41:51 GMT
You might want to check out File Crusher:
http://www.jointhegrid.com/hadoop_filecrush/index.jsp

I've never used it, but it sounds like it could be helpful.

On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks <bejoy.hadoop@gmail.com> wrote:

> Hi Mohit
>      AFAIK XMLLoader in pig won't be suited for Sequence Files. Please
> post the same to Pig user group for some workaround over the same.
>         SequenceFIle is a preferred option when we want to store small
> files in hdfs and needs to be processed by MapReduce as it stores data in
> key value format.Since SequenceFileInputFormat is available at your
> disposal you don't need any custom input formats for processing the same
> using map reduce. It is a cleaner and better approach compared to just
> appending small xml file contents into a big file.
>
> On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia <mohitanchlia@gmail.com
> >wrote:
>
> > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks <bejoy.hadoop@gmail.com>
> wrote:
> >
> > > Mohit
> > >       Rather than just appending the content into a normal text file or
> > > so, you can create a sequence file with the individual smaller file
> > content
> > > as values.
> > >
> > >  Thanks. I was planning to use pig's
> > org.apache.pig.piggybank.storage.XMLLoader
> > for processing. Would it work with sequence file?
> >
> > This text file that I was referring to would be in hdfs itself. Is it
> still
> > different than using sequence file?
> >
> > > Regards
> > > Bejoy.K.S
> > >
> > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia <
> mohitanchlia@gmail.com
> > > >wrote:
> > >
> > > > We have small xml files. Currently I am planning to append these
> small
> > > > files to one file in hdfs so that I can take advantage of splits,
> > larger
> > > > blocks and sequential IO. What I am unsure is if it's ok to append
> one
> > > file
> > > > at a time to this hdfs file
> > > >
> > > > Could someone suggest if this is ok? Would like to know how other do
> > it.
> > > >
> > >
> >
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message