hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arko Provo Mukherjee <arkoprovomukher...@gmail.com>
Subject Re: Writing small files to one big file in hdfs
Date Tue, 21 Feb 2012 19:34:20 GMT
Hi,

Let's say all the smaller files are in the same directory.

Then u can do:

*BufferedWriter output = new BufferedWriter
(newOutputStreamWriter(fs.create(output_path,
true)));  // Output path*

*FileStatus[] output_files = fs.listStatus(new Path(input_path));  // Input
directory*

*for ( int i=0; i < output_files.length; i++ )  *

*{*

*   BufferedReader reader = new
BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath())));
*

*   String data;*

*   data = reader.readLine();*

*   while ( data != null ) *

*  {*

*        output.write(data);*

*  }*

*    reader.close*

*}*

*output.close*


In case you have the files in multiple directories, call the code for each
of them with different input paths.

Hope this helps!

Cheers

Arko

On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia <mohitanchlia@gmail.com>wrote:

> I am trying to look for examples that demonstrates using sequence files
> including writing to it and then running mapred on it, but unable to find
> one. Could you please point me to some examples of sequence files?
>
> On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks <bejoy.hadoop@gmail.com> wrote:
>
> > Hi Mohit
> >      AFAIK XMLLoader in pig won't be suited for Sequence Files. Please
> > post the same to Pig user group for some workaround over the same.
> >         SequenceFIle is a preferred option when we want to store small
> > files in hdfs and needs to be processed by MapReduce as it stores data in
> > key value format.Since SequenceFileInputFormat is available at your
> > disposal you don't need any custom input formats for processing the same
> > using map reduce. It is a cleaner and better approach compared to just
> > appending small xml file contents into a big file.
> >
> > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia <mohitanchlia@gmail.com
> > >wrote:
> >
> > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks <bejoy.hadoop@gmail.com>
> > wrote:
> > >
> > > > Mohit
> > > >       Rather than just appending the content into a normal text file
> or
> > > > so, you can create a sequence file with the individual smaller file
> > > content
> > > > as values.
> > > >
> > > >  Thanks. I was planning to use pig's
> > > org.apache.pig.piggybank.storage.XMLLoader
> > > for processing. Would it work with sequence file?
> > >
> > > This text file that I was referring to would be in hdfs itself. Is it
> > still
> > > different than using sequence file?
> > >
> > > > Regards
> > > > Bejoy.K.S
> > > >
> > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia <
> > mohitanchlia@gmail.com
> > > > >wrote:
> > > >
> > > > > We have small xml files. Currently I am planning to append these
> > small
> > > > > files to one file in hdfs so that I can take advantage of splits,
> > > larger
> > > > > blocks and sequential IO. What I am unsure is if it's ok to append
> > one
> > > > file
> > > > > at a time to this hdfs file
> > > > >
> > > > > Could someone suggest if this is ok? Would like to know how other
> do
> > > it.
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message