hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohit Anchlia <mohitanch...@gmail.com>
Subject Re: Writing small files to one big file in hdfs
Date Wed, 22 Feb 2012 00:13:10 GMT
Need some more help. I wrote sequence file using below code but now when I
run mapreduce job I get "file.*java.lang.ClassCastException*:
org.apache.hadoop.io.LongWritable cannot be cast to
org.apache.hadoop.io.Text" even though I didn't use LongWritable when I
originally wrote to the sequence

//Code to write to the sequence file. There is no LongWritable here

org.apache.hadoop.io.Text key = *new* org.apache.hadoop.io.Text();

BufferedReader buffer = *new* BufferedReader(*new* FileReader(filePath));

String line = *null*;

org.apache.hadoop.io.Text value = *new* org.apache.hadoop.io.Text();

*try* {

writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(),

value.getClass(), SequenceFile.CompressionType.*RECORD*);

*int* i = 1;

*long* timestamp=System.*currentTimeMillis*();

*while* ((line = buffer.readLine()) != *null*) {

key.set(String.*valueOf*(timestamp));

value.set(line);

writer.append(key, value);

i++;

}


On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee <
arkoprovomukherjee@gmail.com> wrote:

> Hi,
>
> I think the following link will help:
> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
>
> Cheers
> Arko
>
> On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia <mohitanchlia@gmail.com
> >wrote:
>
> > Sorry may be it's something obvious but I was wondering when map or
> reduce
> > gets called what would be the class used for key and value? If I used
> > "org.apache.hadoop.io.Text
> > value = *new* org.apache.hadoop.io.Text();" would the map be called with
>  > Text class?
> >
> > public void map(LongWritable key, Text value, Context context) throws
> > IOException, InterruptedException {
> >
> >
> > On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee <
> > arkoprovomukherjee@gmail.com> wrote:
> >
> > > Hi Mohit,
> > >
> > > I am not sure that I understand your question.
> > >
> > > But you can write into a file using:
> > > *BufferedWriter output = new BufferedWriter
> > > (new OutputStreamWriter(fs.create(my_path,true)));*
> > > *output.write(data);*
> > > *
> > > *
> > > Then you can pass that file as the input to your MapReduce program.
> > >
> > > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );*
> > >
> > > From inside your Map/Reduce methods, I think you should NOT be
> tinkering
> > > with the input / output paths of that Map/Reduce job.
> > > Cheers
> > > Arko
> > >
> > >
> > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia <mohitanchlia@gmail.com
> > > >wrote:
> > >
> > > > Thanks How does mapreduce work on sequence file? Is there an example
> I
> > > can
> > > > look at?
> > > >
> > > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee <
> > > > arkoprovomukherjee@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Let's say all the smaller files are in the same directory.
> > > > >
> > > > > Then u can do:
> > > > >
> > > > > *BufferedWriter output = new BufferedWriter
> > > > > (newOutputStreamWriter(fs.create(output_path,
> > > > > true)));  // Output path*
> > > > >
> > > > > *FileStatus[] output_files = fs.listStatus(new Path(input_path));
>  //
> > > > Input
> > > > > directory*
> > > > >
> > > > > *for ( int i=0; i < output_files.length; i++ )  *
> > > > >
> > > > > *{*
> > > > >
> > > > > *   BufferedReader reader = new
> > > > >
> > >
> BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath())));
> > > > > *
> > > > >
> > > > > *   String data;*
> > > > >
> > > > > *   data = reader.readLine();*
> > > > >
> > > > > *   while ( data != null ) *
> > > > >
> > > > > *  {*
> > > > >
> > > > > *        output.write(data);*
> > > > >
> > > > > *  }*
> > > > >
> > > > > *    reader.close*
> > > > >
> > > > > *}*
> > > > >
> > > > > *output.close*
> > > > >
> > > > >
> > > > > In case you have the files in multiple directories, call the code
> for
> > > > each
> > > > > of them with different input paths.
> > > > >
> > > > > Hope this helps!
> > > > >
> > > > > Cheers
> > > > >
> > > > > Arko
> > > > >
> > > > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia <
> > mohitanchlia@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > I am trying to look for examples that demonstrates using sequence
> > > files
> > > > > > including writing to it and then running mapred on it, but unable
> > to
> > > > find
> > > > > > one. Could you please point me to some examples of sequence
> files?
> > > > > >
> > > > > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks <
> bejoy.hadoop@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi Mohit
> > > > > > >      AFAIK XMLLoader in pig won't be suited for Sequence
Files.
> > > > Please
> > > > > > > post the same to Pig user group for some workaround over
the
> > same.
> > > > > > >         SequenceFIle is a preferred option when we want
to
> store
> > > > small
> > > > > > > files in hdfs and needs to be processed by MapReduce as
it
> stores
> > > > data
> > > > > in
> > > > > > > key value format.Since SequenceFileInputFormat is available
at
> > your
> > > > > > > disposal you don't need any custom input formats for processing
> > the
> > > > > same
> > > > > > > using map reduce. It is a cleaner and better approach compared
> to
> > > > just
> > > > > > > appending small xml file contents into a big file.
> > > > > > >
> > > > > > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia <
> > > > > mohitanchlia@gmail.com
> > > > > > > >wrote:
> > > > > > >
> > > > > > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks <
> > > bejoy.hadoop@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Mohit
> > > > > > > > >       Rather than just appending the content
into a normal
> > text
> > > > > file
> > > > > > or
> > > > > > > > > so, you can create a sequence file with the individual
> > smaller
> > > > file
> > > > > > > > content
> > > > > > > > > as values.
> > > > > > > > >
> > > > > > > > >  Thanks. I was planning to use pig's
> > > > > > > > org.apache.pig.piggybank.storage.XMLLoader
> > > > > > > > for processing. Would it work with sequence file?
> > > > > > > >
> > > > > > > > This text file that I was referring to would be in
hdfs
> itself.
> > > Is
> > > > it
> > > > > > > still
> > > > > > > > different than using sequence file?
> > > > > > > >
> > > > > > > > > Regards
> > > > > > > > > Bejoy.K.S
> > > > > > > > >
> > > > > > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia
<
> > > > > > > mohitanchlia@gmail.com
> > > > > > > > > >wrote:
> > > > > > > > >
> > > > > > > > > > We have small xml files. Currently I am
planning to
> append
> > > > these
> > > > > > > small
> > > > > > > > > > files to one file in hdfs so that I can
take advantage of
> > > > splits,
> > > > > > > > larger
> > > > > > > > > > blocks and sequential IO. What I am unsure
is if it's ok
> to
> > > > > append
> > > > > > > one
> > > > > > > > > file
> > > > > > > > > > at a time to this hdfs file
> > > > > > > > > >
> > > > > > > > > > Could someone suggest if this is ok? Would
like to know
> how
> > > > other
> > > > > > do
> > > > > > > > it.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message