hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Rodionov <vrodio...@carrieriq.com>
Subject RE: How to generate a large dataset quickly.
Date Mon, 14 Apr 2014 18:15:55 GMT
There is no need to run M/R unless your cluster is large (very large)
Single multithreaded client can easily ingest 10s of thousands rows per sec.
Check YCSB benchmark tool, for example.

Make sure you disable both region splitting and major compaction during data ingestion
and pre-split regions accordingly to improve overall performance.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

From: Ted Yu [yuzhihong@gmail.com]
Sent: Monday, April 14, 2014 9:16 AM
To: user@hbase.apache.org
Subject: Re: How to generate a large dataset quickly.

I looked at revision history for HFileOutputFormat.java
There was one patch, HBASE-8949, which went into 0.94.11 but it shouldn't
affect throughput much.

If you can use ganglia (or some similar tool) to pinpoint what caused the
low ingest rate, that would give us more clue.

BTW Is upgrading to newer release, such as 0.98.1 (which contains
HBASE-8755), an option for you ?


On Mon, Apr 14, 2014 at 5:41 AM, Guillermo Ortiz <konstt2000@gmail.com>wrote:

> I'm using. 0.94.6-cdh4.4.0,
> I use the bulkload:
> FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER));
> FileOutputFormat.setOutputPath(job, hbasePath);
> HTable table = new HTable(jConf, HBASE_TABLE);
> HFileOutputFormat.configureIncrementalLoad(job, table);
> It seems that it takes really long time when it starts to execute the Puts
> to HBase in the reduce phase.
> 2014-04-14 14:35 GMT+02:00 Ted Yu <yuzhihong@gmail.com>:
> > Which hbase release did you run mapreduce job ?
> >
> > Cheers
> >
> > On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz <konstt2000@gmail.com>
> wrote:
> >
> > > I want to create a large dateset for HBase with different versions and
> > > number of rows. It's about 10M rows and 100 versions to do some
> > benchmarks.
> > >
> > > What's the fastest way to create it?? I'm generating the dataset with a
> > > Mapreduce of 100.000rows and 10verions. It takes 17minutes and size
> > around
> > > 7Gb. I don't know if I could do it quickly. The bottleneck is when
> > > MapReduces write the output and when transfer the output to the
> Reduces.
> >

Confidentiality Notice:  The information contained in this message, including any attachments
hereto, may be confidential and is intended to be read only by the individual or entity to
whom this message is addressed. If the reader of this message is not the intended recipient
or an agent or designee of the intended recipient, please note that any review, use, disclosure
or distribution of this message or its attachments, in any form, is strictly prohibited. 
If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com
and delete or destroy any copy of this message and its attachments.

View raw message