hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <la...@apache.org>
Subject Re: How to generate a large dataset quickly.
Date Mon, 14 Apr 2014 20:51:07 GMT
+1 to what Vladimir said.
For the Puts in question you can also disable the write ahead log (WAL) and issue a flush
on the table after your ingest.

-- Lars


----- Original Message -----
From: Vladimir Rodionov <vrodionov@carrieriq.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Cc: 
Sent: Monday, April 14, 2014 11:15 AM
Subject: RE: How to generate a large dataset quickly.

There is no need to run M/R unless your cluster is large (very large)
Single multithreaded client can easily ingest 10s of thousands rows per sec.
Check YCSB benchmark tool, for example.

Make sure you disable both region splitting and major compaction during data ingestion
and pre-split regions accordingly to improve overall performance.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________

From: Ted Yu [yuzhihong@gmail.com]
Sent: Monday, April 14, 2014 9:16 AM
To: user@hbase.apache.org
Subject: Re: How to generate a large dataset quickly.

I looked at revision history for HFileOutputFormat.java
There was one patch, HBASE-8949, which went into 0.94.11 but it shouldn't
affect throughput much.

If you can use ganglia (or some similar tool) to pinpoint what caused the
low ingest rate, that would give us more clue.

BTW Is upgrading to newer release, such as 0.98.1 (which contains
HBASE-8755), an option for you ?

Cheers


On Mon, Apr 14, 2014 at 5:41 AM, Guillermo Ortiz <konstt2000@gmail.com>wrote:

> I'm using. 0.94.6-cdh4.4.0,
>
> I use the bulkload:
> FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER));
> FileOutputFormat.setOutputPath(job, hbasePath);
> HTable table = new HTable(jConf, HBASE_TABLE);
> HFileOutputFormat.configureIncrementalLoad(job, table);
>
> It seems that it takes really long time when it starts to execute the Puts
> to HBase in the reduce phase.
>
>
>
> 2014-04-14 14:35 GMT+02:00 Ted Yu <yuzhihong@gmail.com>:
>
> > Which hbase release did you run mapreduce job ?
> >
> > Cheers
> >
> > On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz <konstt2000@gmail.com>
> wrote:
> >
> > > I want to create a large dateset for HBase with different versions and
> > > number of rows. It's about 10M rows and 100 versions to do some
> > benchmarks.
> > >
> > > What's the fastest way to create it?? I'm generating the dataset with a
> > > Mapreduce of 100.000rows and 10verions. It takes 17minutes and size
> > around
> > > 7Gb. I don't know if I could do it quickly. The bottleneck is when
> > > MapReduces write the output and when transfer the output to the
> Reduces.
> >
>

Confidentiality Notice:  The information contained in this message, including any attachments
hereto, may be confidential and is intended to be read only by the individual or entity to
whom this message is addressed. If the reader of this message is not the intended recipient
or an agent or designee of the intended recipient, please note that any review, use, disclosure
or distribution of this message or its attachments, in any form, is strictly prohibited. 
If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com
and delete or destroy any copy of this message and its attachments. 

Mime
View raw message