hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rama Ramani <rama.ram...@live.com>
Subject Re: HBase - bulk loading files
Date Sat, 10 Jan 2015 17:08:14 GMT
I am looking for a way to avoid the regionserver hotspotting while doing a bulk load. My input
files to ImportTsv are extracted from a relational store and have monotonically increasing
Ids.


Alternatively, is there a way for ImportTsv to generate its own row key (which does not increase
monotically) and load the column data from the input files? If  there are no options to bulk
load using this tool and spread the load, then I will just write code to generate the rowkey
and use the HBase API for loading. Just wanted to confirm with the experts from this DL


Thanks











From: Ted Yu
Sent: ‎Friday‎, ‎January‎ ‎9‎, ‎2015 ‎2‎:‎14‎ ‎PM
To: user@hbase.apache.org





Salted buckets seem to be concept from other projects, such as Phoenix.

Can you be a bit more specific about your requirement ?

Cheers

On Fri, Jan 9, 2015 at 12:53 PM, Rama Ramani <rama.ramani@live.com> wrote:

> Is there a way to specify Salted buckets with HBase ImportTsv while doing
> bulk load?
>
> Thanks
> Rama
>
> From: rama.ramani@live.com
> To: user@hbase.apache.org
> Subject: RE: HBase - bulk loading files
> Date: Fri, 19 Dec 2014 14:09:09 -0800
>
>
>
>
> 0.98.0.2.1.9.0-2196-hadoop2Hadoop 2.4.0.2.1.9.0-2196Subversion
> git@github.com:hortonworks/hadoop-monarch.git -r cb50542bc92fb77dee52
> No, the clusters were not taking additional load.
> ThanksRama
> > Date: Fri, 19 Dec 2014 13:50:30 -0800
> > Subject: Re: HBase - bulk loading files
> > From: yuzhihong@gmail.com
> > To: user@hbase.apache.org
> >
> > Can you let us know the HBase and hadoop versions you're using ?
> >
> > Were the clusters taking load from other sources when ImportTsv was
> running
> > ?
> >
> > Cheers
> >
> > On Fri, Dec 19, 2014 at 1:43 PM, Rama Ramani <rama.ramani@live.com>
> wrote:
> >
> > > Hello,         I am bulk loading a set of files (about 400MB each) with
> > > "|" as the delimiter using ImportTsv. It takes a long time for the
> 'map'
> > > job to complete on both a 4 node and a 16 node cluster. I tried the
> option
> > > to generate the output (providing -Dimporttsv.bulk.output) which took
> time
> > > indicating that the generation of the output files needs improvement.
> > > I am seeing about 8000 rows / sec for this dataset, the 400MB ingestion
> > > takes about 5-6 mins. How can I improve this? Is there an alternate
> tool I
> > > can use?
> > > ThanksRama
>
>
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message