hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: Bulk Load Sample Code
Date Wed, 10 Nov 2010 19:57:30 GMT
On Wed, Nov 10, 2010 at 11:53 AM, Shuja Rehman <shujamughal@gmail.com> wrote:
> oh! I think u have not read the full post. The essay has 3 paragraphs  :)
>
> *Should I need to add the following line also
>
>  job.setPartitionerClass(TotalOrderPartitioner.class);
>

You need to specify other than default partitioner so yes, above seems
necessary (Be aware that if only one reducer, all may appear to work
though your partitioner is bad... its when you have multiple reducers
that bad partitioner will show).

> which book? OReilly.Hadoop.The.Definitive.Guide.Jun.2009?
>

Yes.  Or 2nd edition, October 2010.

St.Ack

>
>
>
> On Thu, Nov 11, 2010 at 12:49 AM, Stack <stack@duboce.net> wrote:
>
>> Which two questions (you wrote an essay that looked like one big
>> question -- smile).
>> St.Ack
>>
>> On Wed, Nov 10, 2010 at 11:44 AM, Shuja Rehman <shujamughal@gmail.com>
>> wrote:
>> > yeah, I tried it and it did not fails. can u answer other 2 questions as
>> > well?
>> >
>> >
>> >
>> > On Thu, Nov 11, 2010 at 12:15 AM, Stack <stack@duboce.net> wrote:
>> >
>> >> All below looks reasonable (I did not do detailed review of your code
>> >> posting).  Have you tried it?  Did it fail?
>> >> St.Ack
>> >>
>> >> On Wed, Nov 10, 2010 at 11:12 AM, Shuja Rehman <shujamughal@gmail.com>
>> >> wrote:
>> >> > On Wed, Nov 10, 2010 at 9:20 PM, Stack <stack@duboce.net> wrote:
>> >> >
>> >> >> What you need?  bulk-upload, in the scheme of things, is a well
>> >> >> documented feature.  Its also one that has had some exercise and
is
>> >> >> known to work well.  For a 0.89 release and trunk, documentation
is
>> >> >> here: http://hbase.apache.org/docs/r0.89.20100924/bulk-loads.html.
>> >> >> The unit test you refer to below is good for figuring how to run
a
>> job
>> >> >> (Bulk-upload was redone for 0.89/trunk and is much improved over
what
>> >> >> was available in 0.20.x)
>> >> >>
>> >> >
>> >> > *I need to load data into hbase using Hfiles.  *
>> >> >
>> >> > Ok, let me tell what I understand from all these things. Basically
>> there
>> >> are
>> >> > two ways to bulk load into hbase.
>> >> >
>> >> > 1- Using Command Line tools (importtsv, completebulkload )
>> >> > 2- Mapreduce job using HFileOutputFormat
>> >> >
>> >> > At the moment, I have generated the Hfiles using HFileOutputFormat
and
>> >> > loading these files into hbase using completebulkload command line
>> tool.
>> >> > here is my basic code skeleton. Correct me if I do anything wrong.
>> >> >
>> >> > Configuration conf = new Configuration();
>> >> > Job job = new Job(conf, "myjob");
>> >> >
>> >> >    FileInputFormat.setInputPaths(job, input);
>> >> >    job.setJarByClass(ParserDriver.class);
>> >> >    job.setMapperClass(MyParserMapper.class);
>> >> >    job.setNumReduceTasks(1);
>> >> >    job.setInputFormatClass(XmlInputFormat.class);
>> >> >    job.setOutputFormatClass(HFileOutputFormat.class);
>> >> >    job.setOutputKeyClass(ImmutableBytesWritable.class);
>> >> >    job.setOutputValueClass(Put.class);
>> >> >    job.setReducerClass(PutSortReducer.class);
>> >> >
>> >> >    Path outPath = new Path(output);
>> >> >    FileOutputFormat.setOutputPath(job, outPath);
>> >> >          job.waitForCompletion(true);
>> >> >
>> >> > and here is mapper skeleton
>> >> >
>> >> > public class MyParserMapper   extends
>> >> >    Mapper<LongWritable, Text, ImmutableBytesWritable, Put>
{
>> >> >  while(true)
>> >> >   {
>> >> >       Put put = new Put(rowId);
>> >> >      put.add(...);
>> >> >      context.write(rwId, put);
>> >> >   }
>> >> >
>> >> > The link says:
>> >> > *In order to function efficiently, HFileOutputFormat must be
>> configured
>> >> such
>> >> > that each output HFile fits within a single region. In order to do
>> this,
>> >> > jobs use Hadoop's TotalOrderPartitioner class to partition the map
>> output
>> >> > into disjoint ranges of the key space, corresponding to the key ranges
>> of
>> >> > the regions in the table. *"
>> >> >
>> >> > Now according to my configuration above  where i need to set
>> >> > *TotalOrderPartitioner
>> >> > ? *Should I need to add the following line also
>> >> >
>> >> > job.setPartitionerClass(TotalOrderPartitioner.class);
>> >> >
>> >> >
>> >> >
>> >> > On totalorderpartition, this is a partitioner class from hadoop.  The
>> >> >> MR partitioner -- the class that dictates which reducers get what
map
>> >> >> outputs -- is pluggable. The default partitioner does a hash of
the
>> >> >> output key to figure which reducer.  This won't work if you are
>> >> >> looking to have your hfile output totally sorted.
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> >> If you can't figure what its about, I'd suggest you check out the
>> >> >> hadoop book where it gets a good explication.
>> >> >>
>> >> >>   which book? OReilly.Hadoop.The.Definitive.Guide.Jun.2009?
>> >> >
>> >> > On incremental upload, the doc. suggests you look at the output for
>> >> >> LoadIncrementalHFiles command.  Have you done that?  You run
the
>> >> >> command and it'll add in whatever is ready for loading.
>> >> >>
>> >> >
>> >> >   I just use the command line tool for bulk uplaod but not seen
>> >> > LoadIncrementalHFiles  class yet to do it through program
>> >> >
>> >> >
>> >> >  ------------------------------
>> >> >
>> >> >
>> >> >>
>> >> >> St.Ack
>> >> >>
>> >> >>
>> >> >> On Wed, Nov 10, 2010 at 6:47 AM, Shuja Rehman <shujamughal@gmail.com
>> >
>> >> >> wrote:
>> >> >> > Hey Community,
>> >> >> >
>> >> >> > Well...it seems that nobody has experienced with the bulk
load
>> option.
>> >> I
>> >> >> > have found one class which might help to write the code for
it.
>> >> >> >
>> >> >> >
>> >> >>
>> >>
>> https://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/mapreduce/TestHFileOutputFormat.java
>> >> >> >
>> >> >> > From this, you can get the idea how to write map reduce job
to
>> output
>> >> in
>> >> >> > HFiles format. But There is a little confusion about these
two
>> things
>> >> >> >
>> >> >> > 1-TotalOrderPartitioner
>> >> >> > 2-configureIncrementalLoad
>> >> >> >
>> >> >> > Does anybody have idea about how these things and how to configure
>> it
>> >> for
>> >> >> > the job?
>> >> >> >
>> >> >> > Thanks
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Wed, Nov 10, 2010 at 1:02 AM, Shuja Rehman <
>> shujamughal@gmail.com>
>> >> >> wrote:
>> >> >> >
>> >> >> >> Hi
>> >> >> >>
>> >> >> >> I am trying to investigate the bulk load option as described
in
>> the
>> >> >> >> following link.
>> >> >> >>
>> >> >> >> http://hbase.apache.org/docs/r0.89.20100621/bulk-loads.html
>> >> >> >>
>> >> >> >> Does anybody have sample code or have used it before?
>> >> >> >> Can it be helpful to insert data into existing table.
In my
>> scenario,
>> >> I
>> >> >> >> have one table with 1 column family in which data will
be inserted
>> >> every
>> >> >> 15
>> >> >> >> minutes.
>> >> >> >>
>> >> >> >> Kindly share your experiences
>> >> >> >>
>> >> >> >> Thanks
>> >> >> >> --
>> >> >> >> Regards
>> >> >> >> Shuja-ur-Rehman Baig
>> >> >> >> <http://pk.linkedin.com/in/shujamughal>
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Regards
>> >> >> > Shuja-ur-Rehman Baig
>> >> >> > <http://pk.linkedin.com/in/shujamughal>
>> >> >> >
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Regards
>> >> > Shuja-ur-Rehman Baig
>> >> > <http://pk.linkedin.com/in/shujamughal>
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Regards
>> > Shuja-ur-Rehman Baig
>> > <http://pk.linkedin.com/in/shujamughal>
>> >
>>
>
>
>
> --
> Regards
> Shuja-ur-Rehman Baig
> <http://pk.linkedin.com/in/shujamughal>
>

Mime
View raw message