Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of calvin.lists@gmail.com
 designates 209.85.219.214 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=X5SQ6IzfHKzXelPiQSTUNa0C8Aqe41ZHKQDLmjlBznkVk0pJSYWImC+U3V+AsQXfKt
         Ehd+Ggft8MttGarUYMgSF2IHKMMCo6HJmlzHn2gYj/KX27xW+chyUromB2u7fH8lGxpi
         aLXAwAX9L0SL9+MnKj2yjCgNZadx8hhbXvoFQ=
MIME-Version: 1.0
In-Reply-To: <78568af10911301447s5a203917wc2e1895ac6ff36c6@mail.gmail.com>
References: <eb5b0ce50911301433v6192bde2o2131edebbcd0730b@mail.gmail.com>
	 <31a243e70911301441t3f835086x563cba32d66ab1a8@mail.gmail.com>
	 <78568af10911301447s5a203917wc2e1895ac6ff36c6@mail.gmail.com>
Date: Mon, 30 Nov 2009 18:33:25 -0500
Message-ID: <eb5b0ce50911301533ydf17d07g68710b0b6e07bde@mail.gmail.com>
Subject: Re: hbase bulk writes
From: Calvin <calvin.lists@gmail.com>
To: hbase-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=00504502cfb4667e6b04799f118b

--00504502cfb4667e6b04799f118b
Content-Type: text/plain; charset=ISO-8859-1

Thanks for the responses.  If I can avoid writing a map-reduce job that
would be preferable (getting map-reduce to work with / depend on my existing
infrastructure is turning out to be annoying).

I have no good way of randomizing my dataset since it's a very large stream
of sequential data (ordered by some key).  I have a fair number of column
families (~25) and every column is a long or a double.  Having a standalone
program that writes rows using the HTable / Put API seems to run at ~2-5000
rows/sec, which seems ridiculously slow.  Is it possible I am doing
something terribly wrong?

-Calvin

On Mon, Nov 30, 2009 at 5:47 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:

> Sequentially ordered rows is the worst insert case in HBase - you end
> up writing all to 1 server even if you have 500.  If you could
> randomize your input, and I have pasted a Randomize.java map reduce
> that will randomize lines of a file, then your performance will
> improve.
>
> I have seen sustained inserts of 100-300k rows/sec on small rows
> before.  Obviously large blob rows will be slower, since the limiting
> factor is how fast we can write data to HDFS, thus it isnt the actual
> row count, but the amount of data involved.
>
> Try the randomize.java, see where that gets you. I think it's on the
> list archives.
>
> -ryan
>
>
> On Mon, Nov 30, 2009 at 2:41 PM, Jean-Daniel Cryans <jdcryans@apache.org>
> wrote:
> > Could you put your data in HDFS and load it from there with a MapReduce
> job?
> >
> > J-D
> >
> > On Mon, Nov 30, 2009 at 2:33 PM, Calvin <calvin.lists@gmail.com> wrote:
> >> I have a large amount of sequential ordered rows I would like to write
> to an
> >> HBase table.  What is the preferred way to do bulk writes of
> multi-column
> >> tables in HBase?  Using the get/put interface seems fairly slow even if
> I
> >> bulk writes with table.put(List<Put>).
> >>
> >> I have followed the directions on:
> >>   * http://wiki.apache.org/hadoop/PerformanceTuning
> >>   *
> >>
> http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html
> >>
> >> Are there any other resources for improving the throughput of my bulk
> >> writes?  On
> >>
> http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/mapreduce/package-summary.htmlI
> >> see there's a way to write HFiles directly, but HFileOutputFormat can
> >> only
> >> write a single column famly at a time (
> >> https://issues.apache.org/jira/browse/HBASE-1861).
> >>
> >> Thanks!
> >>
> >> -Calvin
> >>
> >
>

--00504502cfb4667e6b04799f118b--