hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: File Per Column in Hadoop
Date Tue, 11 Mar 2008 16:04:55 GMT
Richard K. Turner wrote:
> To get the data in separate files, I would need to put each column in its own column
family using a row id as the key.  Each column family will end up as a seperate file.  HBase
will sort each column family independently, so if I had 100 columns I would be doing 100 times
more sorting than I need to do.  I believe all of this sorting would make insert rates really
> HBase supports an arbitrary number of columns per a row in a column family.  To do this
each row value has <col name>=<col value> pairs.  For my case this is unessecary
overhead as I would only have one column name per column family.

For sure there is a cost keeping columns in a manner that facilitates 
row-based accesses -- fatter keys and sort/compactions -- and that 
allows on-the-fly cell-level updates.   If your access pattern is purely 
columnar and your data static, for sure, there is little sense paying 
the overhead.

> It seems that when a map reduce job is run against HBase that it reads input through
the HBase server.  I suspect reading gzip files off local disk is much faster, but I am not
Yes.  We're doing our best to minimize the tax you pay accessing via 
hbase but we have some ways to go yet.  You can get some sense of it 
from the table at the end of this page, 
http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation, where we 
compare accesses that go direct against (non-compressed) mapfiles and 
then the same access via hbase.


> -----Original Message-----
> From: Ted Dunning [mailto:tdunning@veoh.com]
> Sent: Mon 3/10/2008 2:57 PM
> To: core-user@hadoop.apache.org
> Subject: Re: File Per Column in Hadoop
> Have you looked at hbase.  It looks like you are trying to reimplement a
> bunch of it.
> On 3/10/08 11:01 AM, "Richard K. Turner" <rkt@petersontechnology.com> wrote:
>> ... [storing data in columns is nice] ... I would also do the same for dir
> csv_file2.  Does anyone know how to do this
>> in Hadoop?

View raw message