Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@locus.apache.org Received: (qmail 64967 invoked from network); 15 Jan 2009 21:44:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 15 Jan 2009 21:44:39 -0000 Received: (qmail 93342 invoked by uid 500); 15 Jan 2009 21:44:39 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 93317 invoked by uid 500); 15 Jan 2009 21:44:39 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 93306 invoked by uid 99); 15 Jan 2009 21:44:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Jan 2009 13:44:39 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of timrobertson100@gmail.com designates 209.85.200.168 as permitted sender) Received: from [209.85.200.168] (HELO wf-out-1314.google.com) (209.85.200.168) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Jan 2009 21:44:30 +0000 Received: by wf-out-1314.google.com with SMTP id 24so1310523wfg.2 for ; Thu, 15 Jan 2009 13:44:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type :content-transfer-encoding:content-disposition:references; bh=+aA73AWZD8hTfitr/JeD0hKRjdy0NFTrxDQSQ0A37Oc=; b=wMmw5HMt4beLiUzK6Otgx3pqtvR36GA+syGaYhnAxJKs60rnRioqwnxT2uQhjIVGnO H4GcWQ+oHLR452vQHW3GSI+YNH6OFOSGVTWX2Iz0VGrf9c+inNd4EDAiygSJuscNsUXU OJw8VXXnhcz5B/IGotQO3hKit63LMcLeqsvS4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:content-transfer-encoding:content-disposition :references; b=G3ZEAaLtBatHBpovj6FgozD1RyhfJDGoLb+bVpO/k4AYOWdFnpV2s88uF7uqM/kwVD Ieb05syxMhyuPSkxNmhuoiAnNDJby/31p0oFOZywR6KWAh80C0+5uFj71wt/4e+0cmgM F1ZLesJ1T30NrPHuqM+sBF3oLIvPmWQJulmWw= Received: by 10.142.125.4 with SMTP id x4mr378318wfc.233.1232055850694; Thu, 15 Jan 2009 13:44:10 -0800 (PST) Received: by 10.142.141.6 with HTTP; Thu, 15 Jan 2009 13:44:10 -0800 (PST) Message-ID: <32120a6a0901151344p4ac2801ex440a9e3a69798f4d@mail.gmail.com> Date: Thu, 15 Jan 2009 22:44:10 +0100 From: "tim robertson" To: hbase-user@hadoop.apache.org Subject: Re: Question to speaker (tab file loading) at yesterdays user group In-Reply-To: <78568af10901151312w642637dbk8d7f20920e3b9c14@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <32120a6a0901142330x7611f8eex65a650281e42e239@mail.gmail.com> <78568af10901151312w642637dbk8d7f20920e3b9c14@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org Thank you very much Ryan, your work will save me a lot of time. I saw your blog post after I wrote to the list. > I then rewrote this stuff into a map-reduce and I can now insert 440m > records in about 70-80 minutes. That is excellent. Could you elaborate on this please? You still insert to HBase but using MR? Is it Identity Map and Reducers and then HBase outputFormat or all happening in the Map perhaps? > The hardware: > - 4 cpu, 128 gb ram > - 1 tb disk Is this just all 1 machine or across 4? Thanks for sharing your findings. Tim > > Here are some relevant configs: > hbase-env.sh: > export HBASE_HEAPSIZE=5000 > > hadoop-site.xml: > > dfs.datanode.socket.write.tiemout > 0 > > > > dfs.datanode.max.xcievers > 2047 > > > > dfs.datanode.handler.count > 10 > > > > > > > > On Wed, Jan 14, 2009 at 11:30 PM, tim robertson > wrote: > >> Hi all, >> >> I was skyping in yesterday from Europe. >> Being half asleep and on a bad wireless, it was not too easy to hear >> sometimes, and I have some quick questions to the person who was >> describing his tab file (CSV?) loading at the beginning. >> >> Could you please summarise quickly again the stats you mentioned? >> Number rows, size file size pre loading, was it 7 Strings? per row, >> size after load, time to load etc >> >> Also, could you please quickly summarise your cluster hardware (spec, >> ram + number nodes)? >> >> What did you find sped it up? >> >> How many columns per family were you using and did this affect much >> (presumably less mean fewer region splits right?) >> >> The reason I ask is I have around 50G in tab file (representing 162M >> rows from mysql with around 50 fields - strings of <20 chars and int >> mostly) and will be loading HBase with this. Once this initial import >> is done, I will then harvest XML and Tab files into HBase directly >> (storing the raw XML record or tab file row as well). >> I am in early testing (awaiting hardware and fed up using EC2) so >> still running code on laptop and small tests. I have 6 dell boxes (2 >> proc, 5G memory, SCSI?) being freed up in 3-4 weeks and wonder what >> performance I will get. >> >> Thanks, >> >> Tim >> >