hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From innowireless TaeYun Kim <taeyun....@innowireless.co.kr>
Subject RE: Question on the number of column families
Date Thu, 07 Aug 2014 04:34:14 GMT
Thank you Ted.

But RowFilter class has no method that can be uses to set which column family is essential.
(Actually no built-in filter class provides such a method)

So, if I (ever) want to apply the 'dummy' column family technique(?), it seems that I must
do as follows:

- Write my own filter that's a subclass of the RowFilter.
- In that filter class, override isFamilyEssential() method to return true only when the name
of the 'dummy' column family is passed as an argument.

Now, HBase calls isFamilyEssential() method of my filter object for all the column families
including the 'dummy' column family, and in result only loads the 'dummy' column family and
happily filters rowkey using the KeyValue objects from the 'dummy' column family HFile(s).

Am I right?

BTW, it would be nice to have a method like 'setEssentialColumnFamilies(byte[][] names)' to
set the essential families manually, since no built-in filter intelligently determines which
column family is essential, except for SingleColumnValueFilter.

Thanks.

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Thursday, August 07, 2014 12:38 PM
To: user@hbase.apache.org
Subject: Re: Question on the number of column families

bq. While scanning, an entire row will be read even for a rowkey filtering

If you specify essential column family in your filter, the above would not be true - only
the essential column family would be loaded into memory first. Once the filter passes, the
other family would be loaded.

Cheers


On Wed, Aug 6, 2014 at 4:00 AM, innowireless TaeYun Kim < taeyun.kim@innowireless.co.kr>
wrote:

> Hi Ted,
>
> Now I finished reading the filtering section and the source code of 
> TestJoinedScanners(0.94).
>
> Facts learned:
>
> - While scanning, an entire row will be read even for a rowkey filtering.
> (Since a rowkey is not a physically separate entity and stored in 
> KeyValue object, it's natural. Am I right?)
> - The key API for the essential column family support is 
> setLoadColumnFamiliesOnDemand().
>
> So, now I have questions:
>
> On rowkey filtering, which column family's KeyValue object is read?
> If HBase just reads a KeyValue from a randomly selected (or just the
> first) column family, how is setLoadColumnFamiliesOnDemand() affected? 
> Can HBase select a smaller column family intelligently?
>
> If setLoadColumnFamiliesOnDemand() can be applied to a rowkey 
> filtering, a 'dummy' column family can be used to minimize the scan cost.
>
> Thank you.
>
>
> -----Original Message-----
> From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
> Sent: Wednesday, August 06, 2014 1:48 PM
> To: user@hbase.apache.org
> Subject: RE: Question on the number of column families
>
> Thank you.
>
> The 'dummy' column will always hold the value '1' (or even an empty 
> string), that only signifies that this row exists. (And the real value 
> is in the other 'big' column family) The value is irrelevant since 
> with current schema the filtering will be done by rowkey components 
> alone. No column value is needed. (I will begin reading the filtering 
> section shortly
> - it is only 6 pages ahead. So sorry for my premature thoughts)
>
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Wednesday, August 06, 2014 1:38 PM
> To: user@hbase.apache.org
> Subject: Re: Question on the number of column families
>
> bq. add a 'dummy' column family and apply HBASE-5416 technique
>
> Adding dummy column family is not the way to utilize essential column 
> family support - what would this dummy column family hold ?
>
> bq. since I have not read the filtering section of the book I'm 
> reading yet
>
> Once you finish reading, you can look at the unit test
> (TestJoinedScanners) from HBASE-5416. You would understand this 
> feature better.
>
> Cheers
>
>
> On Tue, Aug 5, 2014 at 9:21 PM, innowireless TaeYun Kim < 
> taeyun.kim@innowireless.co.kr> wrote:
>
> > Thank you all.
> >
> > Facts learned:
> >
> > - Having 130 column families is too much. Don't do that.
> > - While scanning, an entire row will be read for filtering, unless
> > HBASE-5416 technique is applied which makes only relevant column 
> > family is loaded. (But it seems that still one can't load just a 
> > column needed while
> > scanning)
> > - Big row size is maybe not good.
> >
> > Currently it seems appropriate to follow the one-column solution 
> > that Alok Singh suggested, in part since currently there is no 
> > reasonable grouping of the fields.
> >
> > Here is my current thinking:
> >
> > - One column family, one column. Field name will be included in rowkey.
> > - Eliminate filtering altogether (in most case) by properly ordering 
> > rowkey components.
> > - If a filtering is absolutely needed, add a 'dummy' column family 
> > and apply HBASE-5416 technique to minimize disk read, since the 
> > field value can be large(~5MB). (This dummy column thing may not be 
> > right, I'm not sure, since I have not read the filtering section of 
> > the book I'm reading yet)
> >
> > Hope that I am not missing or misunderstanding something...
> > (I'm a total newbie. I've started to read a HBase book since last
> > week...)
> >
> >
> >
> >
> >
> >
>
>


Mime
View raw message