hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Rawson" <ryano...@gmail.com>
Subject Re: Question to speaker (tab file loading) at yesterdays user group
Date Thu, 15 Jan 2009 21:12:27 GMT
I think you were referring to my presentation.

I was importing a CSV file, of 6 integers.  Obviously in the CSV file, the
integers were their ASCII representation.  So my code had to atoi() the
strings, then pack them into Thrift records, serialize those, and finally
insert the binary thrift rep into hbase with a key.

I had 3 versions:
- thrift gateway - this was the slowest, doing 20m records in 6 hours.  The
init code looks like:
    transport = TSocket.TSocket(hbaseMaster, hbasePort)
    transport = TTransport.TBufferedTransport(transport)
    protocol = TBinaryProtocol.TBinaryProtocol(transport)
    client = Hbase.Client(protocol)
    transport.open()

So using buffered transport, but no specific hbase API calls to set auto
flush or other params. This is in CPython.

- HBase API version #1:
Written in Jython, this is substantially faster, doing 20m records in 70
minutes, or 4 per ms.  This performance scales up to at least 6 processes.

- HBase API version #2:
Slightly smarter, I now call:
table.setAutoFlush(False)
table.setWriteBufferSize(1024*1024*12)

And my speed jumps up to between 30-50 inserts per ms, scaling to at least 6
concurrent processes.

I then rewrote this stuff into a map-reduce and I can now insert 440m
records in about 70-80 minutes.

As I move forward, I will be emulating bigtable and using either thrift
serialized records or protobufs to store data in cells.  This allows you to
forward/backwards compatiblly extend data within individual cells.  Until
compression is super solid, I would be wary of storing text (xml, html, etc)
in hbase due to size concerns.


The hardware:
- 4 cpu, 128 gb ram
- 1 tb disk

Here are some relevant configs:
hbase-env.sh:
export HBASE_HEAPSIZE=5000

hadoop-site.xml:
<property>
<name>dfs.datanode.socket.write.tiemout</name>
<value>0</value>
</property>

<property>
<name>dfs.datanode.max.xcievers</name>
<value>2047</value>
</property>

<property>
<name>dfs.datanode.handler.count</name>
<value>10</value>
</property>






On Wed, Jan 14, 2009 at 11:30 PM, tim robertson
<timrobertson100@gmail.com>wrote:

> Hi all,
>
> I was skyping in yesterday from Europe.
> Being half asleep and on a bad wireless, it was not too easy to hear
> sometimes, and I have some quick questions to the person who was
> describing his tab file (CSV?) loading at the beginning.
>
> Could you please summarise quickly again the stats you mentioned?
> Number rows, size file size pre loading, was it 7 Strings? per row,
> size after load, time to load etc
>
> Also, could you please quickly summarise your cluster hardware (spec,
> ram + number nodes)?
>
> What did you find sped it up?
>
> How many columns per family were you using and did this affect much
> (presumably less mean fewer region splits right?)
>
> The reason I ask is I have around 50G in tab file (representing 162M
> rows from mysql with around 50 fields - strings of <20 chars and int
> mostly) and will be loading HBase with this.  Once this initial import
> is done, I will then harvest XML and Tab files into HBase directly
> (storing the raw XML record or tab file row as well).
> I am in early testing (awaiting hardware and fed up using EC2) so
> still running code on laptop and small tests.  I have 6 dell boxes (2
> proc, 5G memory, SCSI?) being freed up in 3-4 weeks and wonder what
> performance I will get.
>
> Thanks,
>
> Tim
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message