hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shuja Rehman <shujamug...@gmail.com>
Subject Re: Bulk Load Sample Code
Date Wed, 10 Nov 2010 19:12:17 GMT
On Wed, Nov 10, 2010 at 9:20 PM, Stack <stack@duboce.net> wrote:

> What you need?  bulk-upload, in the scheme of things, is a well
> documented feature.  Its also one that has had some exercise and is
> known to work well.  For a 0.89 release and trunk, documentation is
> here: http://hbase.apache.org/docs/r0.89.20100924/bulk-loads.html.
> The unit test you refer to below is good for figuring how to run a job
> (Bulk-upload was redone for 0.89/trunk and is much improved over what
> was available in 0.20.x)
>

*I need to load data into hbase using Hfiles.  *

Ok, let me tell what I understand from all these things. Basically there are
two ways to bulk load into hbase.

1- Using Command Line tools (importtsv, completebulkload )
2- Mapreduce job using HFileOutputFormat

At the moment, I have generated the Hfiles using HFileOutputFormat and
loading these files into hbase using completebulkload command line tool.
here is my basic code skeleton. Correct me if I do anything wrong.

Configuration conf = new Configuration();
Job job = new Job(conf, "myjob");

    FileInputFormat.setInputPaths(job, input);
    job.setJarByClass(ParserDriver.class);
    job.setMapperClass(MyParserMapper.class);
    job.setNumReduceTasks(1);
    job.setInputFormatClass(XmlInputFormat.class);
    job.setOutputFormatClass(HFileOutputFormat.class);
    job.setOutputKeyClass(ImmutableBytesWritable.class);
    job.setOutputValueClass(Put.class);
    job.setReducerClass(PutSortReducer.class);

    Path outPath = new Path(output);
    FileOutputFormat.setOutputPath(job, outPath);
          job.waitForCompletion(true);

and here is mapper skeleton

public class MyParserMapper   extends
    Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {
 while(true)
   {
       Put put = new Put(rowId);
      put.add(...);
      context.write(rwId, put);
   }

The link says:
*In order to function efficiently, HFileOutputFormat must be configured such
that each output HFile fits within a single region. In order to do this,
jobs use Hadoop's TotalOrderPartitioner class to partition the map output
into disjoint ranges of the key space, corresponding to the key ranges of
the regions in the table. *"

Now according to my configuration above  where i need to set
*TotalOrderPartitioner
? *Should I need to add the following line also

job.setPartitionerClass(TotalOrderPartitioner.class);



On totalorderpartition, this is a partitioner class from hadoop.  The
> MR partitioner -- the class that dictates which reducers get what map
> outputs -- is pluggable. The default partitioner does a hash of the
> output key to figure which reducer.  This won't work if you are
> looking to have your hfile output totally sorted.
>
>


> If you can't figure what its about, I'd suggest you check out the
> hadoop book where it gets a good explication.
>
>   which book? OReilly.Hadoop.The.Definitive.Guide.Jun.2009?

On incremental upload, the doc. suggests you look at the output for
> LoadIncrementalHFiles command.  Have you done that?  You run the
> command and it'll add in whatever is ready for loading.
>

   I just use the command line tool for bulk uplaod but not seen
LoadIncrementalHFiles  class yet to do it through program


 ------------------------------


>
> St.Ack
>
>
> On Wed, Nov 10, 2010 at 6:47 AM, Shuja Rehman <shujamughal@gmail.com>
> wrote:
> > Hey Community,
> >
> > Well...it seems that nobody has experienced with the bulk load option. I
> > have found one class which might help to write the code for it.
> >
> >
> https://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/mapreduce/TestHFileOutputFormat.java
> >
> > From this, you can get the idea how to write map reduce job to output in
> > HFiles format. But There is a little confusion about these two things
> >
> > 1-TotalOrderPartitioner
> > 2-configureIncrementalLoad
> >
> > Does anybody have idea about how these things and how to configure it for
> > the job?
> >
> > Thanks
> >
> >
> >
> > On Wed, Nov 10, 2010 at 1:02 AM, Shuja Rehman <shujamughal@gmail.com>
> wrote:
> >
> >> Hi
> >>
> >> I am trying to investigate the bulk load option as described in the
> >> following link.
> >>
> >> http://hbase.apache.org/docs/r0.89.20100621/bulk-loads.html
> >>
> >> Does anybody have sample code or have used it before?
> >> Can it be helpful to insert data into existing table. In my scenario, I
> >> have one table with 1 column family in which data will be inserted every
> 15
> >> minutes.
> >>
> >> Kindly share your experiences
> >>
> >> Thanks
> >> --
> >> Regards
> >> Shuja-ur-Rehman Baig
> >> <http://pk.linkedin.com/in/shujamughal>
> >>
> >>
> >
> >
> > --
> > Regards
> > Shuja-ur-Rehman Baig
> > <http://pk.linkedin.com/in/shujamughal>
> >
>



-- 
Regards
Shuja-ur-Rehman Baig
<http://pk.linkedin.com/in/shujamughal>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message