Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 85869 invoked from network); 30 Nov 2009 23:33:55 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 30 Nov 2009 23:33:55 -0000 Received: (qmail 28656 invoked by uid 500); 30 Nov 2009 23:33:55 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 28587 invoked by uid 500); 30 Nov 2009 23:33:54 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 28577 invoked by uid 99); 30 Nov 2009 23:33:54 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Nov 2009 23:33:54 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of calvin.lists@gmail.com designates 209.85.219.214 as permitted sender) Received: from [209.85.219.214] (HELO mail-ew0-f214.google.com) (209.85.219.214) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Nov 2009 23:33:46 +0000 Received: by ewy6 with SMTP id 6so3516483ewy.29 for ; Mon, 30 Nov 2009 15:33:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=Shht/0GGuLqgp+RZKa09qoo4OlHrBVTYxQvNuTZK6j4=; b=v7uQtzcU1vNgi/hBEEFXo7k8REQ43WhzIPBSo73H3+90XRvIye/Ch3DaKK8NMQ7J7T TXAi1dcZed8rWNA4dQc2285Lmb8UebAYZjiuB+2eGnVrkBck4yiLclQ00A2Go3buDimf DwigyKAb0Q6jvbTha1lFyHfCmMSFu2exkV+Ug= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=X5SQ6IzfHKzXelPiQSTUNa0C8Aqe41ZHKQDLmjlBznkVk0pJSYWImC+U3V+AsQXfKt Ehd+Ggft8MttGarUYMgSF2IHKMMCo6HJmlzHn2gYj/KX27xW+chyUromB2u7fH8lGxpi aLXAwAX9L0SL9+MnKj2yjCgNZadx8hhbXvoFQ= MIME-Version: 1.0 Received: by 10.213.96.136 with SMTP id h8mr5546386ebn.48.1259624005532; Mon, 30 Nov 2009 15:33:25 -0800 (PST) In-Reply-To: <78568af10911301447s5a203917wc2e1895ac6ff36c6@mail.gmail.com> References: <31a243e70911301441t3f835086x563cba32d66ab1a8@mail.gmail.com> <78568af10911301447s5a203917wc2e1895ac6ff36c6@mail.gmail.com> Date: Mon, 30 Nov 2009 18:33:25 -0500 Message-ID: Subject: Re: hbase bulk writes From: Calvin To: hbase-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=00504502cfb4667e6b04799f118b X-Virus-Checked: Checked by ClamAV on apache.org --00504502cfb4667e6b04799f118b Content-Type: text/plain; charset=ISO-8859-1 Thanks for the responses. If I can avoid writing a map-reduce job that would be preferable (getting map-reduce to work with / depend on my existing infrastructure is turning out to be annoying). I have no good way of randomizing my dataset since it's a very large stream of sequential data (ordered by some key). I have a fair number of column families (~25) and every column is a long or a double. Having a standalone program that writes rows using the HTable / Put API seems to run at ~2-5000 rows/sec, which seems ridiculously slow. Is it possible I am doing something terribly wrong? -Calvin On Mon, Nov 30, 2009 at 5:47 PM, Ryan Rawson wrote: > Sequentially ordered rows is the worst insert case in HBase - you end > up writing all to 1 server even if you have 500. If you could > randomize your input, and I have pasted a Randomize.java map reduce > that will randomize lines of a file, then your performance will > improve. > > I have seen sustained inserts of 100-300k rows/sec on small rows > before. Obviously large blob rows will be slower, since the limiting > factor is how fast we can write data to HDFS, thus it isnt the actual > row count, but the amount of data involved. > > Try the randomize.java, see where that gets you. I think it's on the > list archives. > > -ryan > > > On Mon, Nov 30, 2009 at 2:41 PM, Jean-Daniel Cryans > wrote: > > Could you put your data in HDFS and load it from there with a MapReduce > job? > > > > J-D > > > > On Mon, Nov 30, 2009 at 2:33 PM, Calvin wrote: > >> I have a large amount of sequential ordered rows I would like to write > to an > >> HBase table. What is the preferred way to do bulk writes of > multi-column > >> tables in HBase? Using the get/put interface seems fairly slow even if > I > >> bulk writes with table.put(List). > >> > >> I have followed the directions on: > >> * http://wiki.apache.org/hadoop/PerformanceTuning > >> * > >> > http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html > >> > >> Are there any other resources for improving the throughput of my bulk > >> writes? On > >> > http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/mapreduce/package-summary.htmlI > >> see there's a way to write HFiles directly, but HFileOutputFormat can > >> only > >> write a single column famly at a time ( > >> https://issues.apache.org/jira/browse/HBASE-1861). > >> > >> Thanks! > >> > >> -Calvin > >> > > > --00504502cfb4667e6b04799f118b--