hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arko Provo Mukherjee <arkoprovomukher...@gmail.com>
Subject Re: Writing small files to one big file in hdfs
Date Tue, 21 Feb 2012 20:18:33 GMT
Hi,

I think the following link will help:
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html

Cheers
Arko

On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia <mohitanchlia@gmail.com>wrote:

> Sorry may be it's something obvious but I was wondering when map or reduce
> gets called what would be the class used for key and value? If I used
> "org.apache.hadoop.io.Text
> value = *new* org.apache.hadoop.io.Text();" would the map be called with
> Text class?
>
> public void map(LongWritable key, Text value, Context context) throws
> IOException, InterruptedException {
>
>
> On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee <
> arkoprovomukherjee@gmail.com> wrote:
>
> > Hi Mohit,
> >
> > I am not sure that I understand your question.
> >
> > But you can write into a file using:
> > *BufferedWriter output = new BufferedWriter
> > (new OutputStreamWriter(fs.create(my_path,true)));*
> > *output.write(data);*
> > *
> > *
> > Then you can pass that file as the input to your MapReduce program.
> >
> > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );*
> >
> > From inside your Map/Reduce methods, I think you should NOT be tinkering
> > with the input / output paths of that Map/Reduce job.
> > Cheers
> > Arko
> >
> >
> > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia <mohitanchlia@gmail.com
> > >wrote:
> >
> > > Thanks How does mapreduce work on sequence file? Is there an example I
> > can
> > > look at?
> > >
> > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee <
> > > arkoprovomukherjee@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > Let's say all the smaller files are in the same directory.
> > > >
> > > > Then u can do:
> > > >
> > > > *BufferedWriter output = new BufferedWriter
> > > > (newOutputStreamWriter(fs.create(output_path,
> > > > true)));  // Output path*
> > > >
> > > > *FileStatus[] output_files = fs.listStatus(new Path(input_path));  //
> > > Input
> > > > directory*
> > > >
> > > > *for ( int i=0; i < output_files.length; i++ )  *
> > > >
> > > > *{*
> > > >
> > > > *   BufferedReader reader = new
> > > >
> > BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath())));
> > > > *
> > > >
> > > > *   String data;*
> > > >
> > > > *   data = reader.readLine();*
> > > >
> > > > *   while ( data != null ) *
> > > >
> > > > *  {*
> > > >
> > > > *        output.write(data);*
> > > >
> > > > *  }*
> > > >
> > > > *    reader.close*
> > > >
> > > > *}*
> > > >
> > > > *output.close*
> > > >
> > > >
> > > > In case you have the files in multiple directories, call the code for
> > > each
> > > > of them with different input paths.
> > > >
> > > > Hope this helps!
> > > >
> > > > Cheers
> > > >
> > > > Arko
> > > >
> > > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia <
> mohitanchlia@gmail.com
> > > > >wrote:
> > > >
> > > > > I am trying to look for examples that demonstrates using sequence
> > files
> > > > > including writing to it and then running mapred on it, but unable
> to
> > > find
> > > > > one. Could you please point me to some examples of sequence files?
> > > > >
> > > > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks <bejoy.hadoop@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Hi Mohit
> > > > > >      AFAIK XMLLoader in pig won't be suited for Sequence Files.
> > > Please
> > > > > > post the same to Pig user group for some workaround over the
> same.
> > > > > >         SequenceFIle is a preferred option when we want to store
> > > small
> > > > > > files in hdfs and needs to be processed by MapReduce as it stores
> > > data
> > > > in
> > > > > > key value format.Since SequenceFileInputFormat is available
at
> your
> > > > > > disposal you don't need any custom input formats for processing
> the
> > > > same
> > > > > > using map reduce. It is a cleaner and better approach compared
to
> > > just
> > > > > > appending small xml file contents into a big file.
> > > > > >
> > > > > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia <
> > > > mohitanchlia@gmail.com
> > > > > > >wrote:
> > > > > >
> > > > > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks <
> > bejoy.hadoop@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Mohit
> > > > > > > >       Rather than just appending the content into
a normal
> text
> > > > file
> > > > > or
> > > > > > > > so, you can create a sequence file with the individual
> smaller
> > > file
> > > > > > > content
> > > > > > > > as values.
> > > > > > > >
> > > > > > > >  Thanks. I was planning to use pig's
> > > > > > > org.apache.pig.piggybank.storage.XMLLoader
> > > > > > > for processing. Would it work with sequence file?
> > > > > > >
> > > > > > > This text file that I was referring to would be in hdfs
itself.
> > Is
> > > it
> > > > > > still
> > > > > > > different than using sequence file?
> > > > > > >
> > > > > > > > Regards
> > > > > > > > Bejoy.K.S
> > > > > > > >
> > > > > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia <
> > > > > > mohitanchlia@gmail.com
> > > > > > > > >wrote:
> > > > > > > >
> > > > > > > > > We have small xml files. Currently I am planning
to append
> > > these
> > > > > > small
> > > > > > > > > files to one file in hdfs so that I can take
advantage of
> > > splits,
> > > > > > > larger
> > > > > > > > > blocks and sequential IO. What I am unsure is
if it's ok to
> > > > append
> > > > > > one
> > > > > > > > file
> > > > > > > > > at a time to this hdfs file
> > > > > > > > >
> > > > > > > > > Could someone suggest if this is ok? Would like
to know how
> > > other
> > > > > do
> > > > > > > it.
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message