Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of timrobertson100@gmail.com
 designates 209.85.200.168 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version
         :content-type:content-transfer-encoding:content-disposition
         :references;
        b=G3ZEAaLtBatHBpovj6FgozD1RyhfJDGoLb+bVpO/k4AYOWdFnpV2s88uF7uqM/kwVD
         Ieb05syxMhyuPSkxNmhuoiAnNDJby/31p0oFOZywR6KWAh80C0+5uFj71wt/4e+0cmgM
         F1ZLesJ1T30NrPHuqM+sBF3oLIvPmWQJulmWw=
Message-ID: <32120a6a0901151344p4ac2801ex440a9e3a69798f4d@mail.gmail.com>
Date: Thu, 15 Jan 2009 22:44:10 +0100
From: "tim robertson" <timrobertson100@gmail.com>
To: hbase-user@hadoop.apache.org
Subject: Re: Question to speaker (tab file loading) at yesterdays user group
In-Reply-To: <78568af10901151312w642637dbk8d7f20920e3b9c14@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <32120a6a0901142330x7611f8eex65a650281e42e239@mail.gmail.com>
	 <78568af10901151312w642637dbk8d7f20920e3b9c14@mail.gmail.com>

Thank you very much Ryan, your work will save me a lot of time.

I saw your blog post after I wrote to the list.

> I then rewrote this stuff into a map-reduce and I can now insert 440m
> records in about 70-80 minutes.

That is excellent.
Could you elaborate on this please?  You still insert to HBase but using MR?
Is it Identity Map and Reducers and then HBase outputFormat or all
happening in the Map perhaps?

> The hardware:
> - 4 cpu, 128 gb ram
> - 1 tb disk
Is this just all 1 machine or across 4?

Thanks for sharing your findings.

Tim

>
> Here are some relevant configs:
> hbase-env.sh:
> export HBASE_HEAPSIZE=5000
>
> hadoop-site.xml:
> <property>
> <name>dfs.datanode.socket.write.tiemout</name>
> <value>0</value>
> </property>
>
> <property>
> <name>dfs.datanode.max.xcievers</name>
> <value>2047</value>
> </property>
>
> <property>
> <name>dfs.datanode.handler.count</name>
> <value>10</value>
> </property>
>
>
>
>
>
>
> On Wed, Jan 14, 2009 at 11:30 PM, tim robertson
> <timrobertson100@gmail.com>wrote:
>
>> Hi all,
>>
>> I was skyping in yesterday from Europe.
>> Being half asleep and on a bad wireless, it was not too easy to hear
>> sometimes, and I have some quick questions to the person who was
>> describing his tab file (CSV?) loading at the beginning.
>>
>> Could you please summarise quickly again the stats you mentioned?
>> Number rows, size file size pre loading, was it 7 Strings? per row,
>> size after load, time to load etc
>>
>> Also, could you please quickly summarise your cluster hardware (spec,
>> ram + number nodes)?
>>
>> What did you find sped it up?
>>
>> How many columns per family were you using and did this affect much
>> (presumably less mean fewer region splits right?)
>>
>> The reason I ask is I have around 50G in tab file (representing 162M
>> rows from mysql with around 50 fields - strings of <20 chars and int
>> mostly) and will be loading HBase with this.  Once this initial import
>> is done, I will then harvest XML and Tab files into HBase directly
>> (storing the raw XML record or tab file row as well).
>> I am in early testing (awaiting hardware and fed up using EC2) so
>> still running code on laptop and small tests.  I have 6 dell boxes (2
>> proc, 5G memory, SCSI?) being freed up in 3-4 weeks and wonder what
>> performance I will get.
>>
>> Thanks,
>>
>> Tim
>>
>