hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: Question to speaker (tab file loading) at yesterdays user group
Date Thu, 15 Jan 2009 22:29:34 GMT
Powerset (and others) are profligate and store the content uncompressed 
(I can feel Ryan 'wincing').

General message on compressed data is that its lightly tested.  There 
may be issues yet to surface.  Be wary.  If you trip over any, surface 
them so we can get them fixed especially as Ryan and others are starting 
to report higher throughput when data is compressed (makes sense).

Thanks 'lads',
St.Ack


Ryan Rawson wrote:
> At the user meeting last night, stack noted that since lots of us "lads" are
> noting performance improvement on random-read when we use compression, that
> perhaps a fresh look at making compression solid would be a good thing.
>
> Personally I am just obsessed with on-disk efficiency.  But also, I am
> chasing after random-read performance latencys so I can serve a website out
> of hbase... if that isnt your needs, then perhaps what you want to do would
> be just fine as it?
>
> -ryan
>
> On Thu, Jan 15, 2009 at 2:11 PM, tim robertson <timrobertson100@gmail.com>wrote:
>
>   
>>> Until compression is super solid, I would be wary of storing text (xml,
>>>       
>> html, etc)
>>     
>>> in hbase due to size concerns.
>>>       
>> Hmmm... Where do the indexing guys store their raw harvested records /
>> HTML / whatever then?
>>
>> I guess mine would be coming in at 200G as text or so, per 100M
>> records (maybe looking to 1Billion records over next 24 months).  Can
>> someone suggest a better place to store the records if not HBase?  I
>> want to be able to serve them as cached records, and also use them as
>> sources for new indexes, without harvesting again.  This is classic
>> use case of HBase I thought... I mean, it is even on the HBase
>> architecture page as the example table structure:
>> http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture.  Bit surprised
>> to hear it is not recommended use.
>>
>> Cheers for pointers and sorry for the question bombardment - just
>> trying to catch up.
>>
>> Tim
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Jan 15, 2009 at 10:12 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
>>     
>>> I think you were referring to my presentation.
>>>
>>> I was importing a CSV file, of 6 integers.  Obviously in the CSV file,
>>>       
>> the
>>     
>>> integers were their ASCII representation.  So my code had to atoi() the
>>> strings, then pack them into Thrift records, serialize those, and finally
>>> insert the binary thrift rep into hbase with a key.
>>>
>>> I had 3 versions:
>>> - thrift gateway - this was the slowest, doing 20m records in 6 hours.
>>>       
>>  The
>>     
>>> init code looks like:
>>>    transport = TSocket.TSocket(hbaseMaster, hbasePort)
>>>    transport = TTransport.TBufferedTransport(transport)
>>>    protocol = TBinaryProtocol.TBinaryProtocol(transport)
>>>    client = Hbase.Client(protocol)
>>>    transport.open()
>>>
>>> So using buffered transport, but no specific hbase API calls to set auto
>>> flush or other params. This is in CPython.
>>>
>>> - HBase API version #1:
>>> Written in Jython, this is substantially faster, doing 20m records in 70
>>> minutes, or 4 per ms.  This performance scales up to at least 6
>>>       
>> processes.
>>     
>>> - HBase API version #2:
>>> Slightly smarter, I now call:
>>> table.setAutoFlush(False)
>>> table.setWriteBufferSize(1024*1024*12)
>>>
>>> And my speed jumps up to between 30-50 inserts per ms, scaling to at
>>>       
>> least 6
>>     
>>> concurrent processes.
>>>
>>> I then rewrote this stuff into a map-reduce and I can now insert 440m
>>> records in about 70-80 minutes.
>>>
>>> As I move forward, I will be emulating bigtable and using either thrift
>>> serialized records or protobufs to store data in cells.  This allows you
>>>       
>> to
>>     
>>> forward/backwards compatiblly extend data within individual cells.  Until
>>> compression is super solid, I would be wary of storing text (xml, html,
>>>       
>> etc)
>>     
>>> in hbase due to size concerns.
>>>
>>>
>>> The hardware:
>>> - 4 cpu, 128 gb ram
>>> - 1 tb disk
>>>
>>> Here are some relevant configs:
>>> hbase-env.sh:
>>> export HBASE_HEAPSIZE=5000
>>>
>>> hadoop-site.xml:
>>> <property>
>>> <name>dfs.datanode.socket.write.tiemout</name>
>>> <value>0</value>
>>> </property>
>>>
>>> <property>
>>> <name>dfs.datanode.max.xcievers</name>
>>> <value>2047</value>
>>> </property>
>>>
>>> <property>
>>> <name>dfs.datanode.handler.count</name>
>>> <value>10</value>
>>> </property>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jan 14, 2009 at 11:30 PM, tim robertson
>>> <timrobertson100@gmail.com>wrote:
>>>
>>>       
>>>> Hi all,
>>>>
>>>> I was skyping in yesterday from Europe.
>>>> Being half asleep and on a bad wireless, it was not too easy to hear
>>>> sometimes, and I have some quick questions to the person who was
>>>> describing his tab file (CSV?) loading at the beginning.
>>>>
>>>> Could you please summarise quickly again the stats you mentioned?
>>>> Number rows, size file size pre loading, was it 7 Strings? per row,
>>>> size after load, time to load etc
>>>>
>>>> Also, could you please quickly summarise your cluster hardware (spec,
>>>> ram + number nodes)?
>>>>
>>>> What did you find sped it up?
>>>>
>>>> How many columns per family were you using and did this affect much
>>>> (presumably less mean fewer region splits right?)
>>>>
>>>> The reason I ask is I have around 50G in tab file (representing 162M
>>>> rows from mysql with around 50 fields - strings of <20 chars and int
>>>> mostly) and will be loading HBase with this.  Once this initial import
>>>> is done, I will then harvest XML and Tab files into HBase directly
>>>> (storing the raw XML record or tab file row as well).
>>>> I am in early testing (awaiting hardware and fed up using EC2) so
>>>> still running code on laptop and small tests.  I have 6 dell boxes (2
>>>> proc, 5G memory, SCSI?) being freed up in 3-4 weeks and wonder what
>>>> performance I will get.
>>>>
>>>> Thanks,
>>>>
>>>> Tim
>>>>
>>>>         
>
>   


Mime
View raw message