hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Kang <weliam.cl...@gmail.com>
Subject Re: More on Column Family versus Column
Date Mon, 18 Oct 2010 19:43:31 GMT
Hi Jacques,
No. A block can certainly contain multiple rows.


William

On Mon, Oct 18, 2010 at 12:17 PM, Jacques <whshub@gmail.com> wrote:
> I'm trying to work up a reference card to remember this stuff.  Can someone
> confirm or deny the following statements?
>
> Each hbase block can hold at most, one row and one column family.  A row may
> contain multiple hbase blocks but an hbase block may only contain one row.
>
> Thanks,
> Jacques
>
> On Fri, Oct 15, 2010 at 8:54 PM, William Kang <weliam.cloud@gmail.com>wrote:
>
>> Hi Jacques,
>> If I understand correctly, it depends on several factors. First is the
>> configured block size; second is the typical cell size. A block may
>> have multiple keyvalue pairs. If the block size is bigger than the
>> cell size, a block may have multiple cells, which are stored in block
>> as keyvalue pairs. To locate a keyvalue pair, you have to traverse
>> through within the block if there are multiple keyvalue pairs inside
>> the block.
>> With that being said, if you have a column family with lots of very
>> small cell values and large block size, it is going to be slow to
>> traverse inside the block to locate the wanted cell. But, if you have
>> a column family with few big cells inside it and the block size is
>> only big enough to host one cell, there is no need to traverse in the
>> block.
>> Hope it helps a little.
>>
>>
>> William
>>
>> On Fri, Oct 15, 2010 at 8:27 PM, Jacques <whshub@gmail.com> wrote:
>> > I was hoping for some feedback on a schema design choice we made.
>> >
>> > We are currently using column families to separate out some data in a
>> table
>> > (based on what we've read here and elsewhere).  I try to outline the
>> basic
>> > below.
>> >
>> > *Pseudo schema*
>> > metadata column family: multiple metadata columns totaling ~3-5k total
>> > data column family 1: single column, 100-200k
>> > data column family 2: same as data column family 1
>> > ...
>> > data column family 1500: same as data column family 1
>> >
>> > General access pattern:
>> > write: main cf + one random data cf.
>> > read: main cf + one random data cf.
>> >
>> > The further we go towards the 1500, the more sparse the data is.  E.g.
>> every
>> > row has data for cf1, most have for cf2, only 1 in a million might have
>> it
>> > for cf1500.
>> > We chose to use column families because we never/rarely change or
>> retrieve
>> > two "data" column families at the same time.  We store this information
>> in a
>> > single row so that we have atomic changes to the dataset.
>> >
>> > Everything is working fine.  However, the discussion earlier this week
>> about
>> > column families made me realize that my understanding of columns wasn't
>> > entirely correct.  I was under the impression that an entire column
>> family
>> > was read when retrieving any column in that family.  It sounds like this
>> is
>> > becoming less true as development move towards .90 and beyond.  I also
>> > noticed that the web status gui doesn't do tables with many column
>> families
>> > any justice.  This makes me wonder if people are using tables with
>> thousands
>> > of column families or if it is very rare?  How do people accomplish
>> > "millions of columns"?  10 families with 100,000 columns each or 10,000
>> > families with 100's of columns each?
>> >
>> > Thanks for any feedback,
>> >
>> > Jacques
>> >
>>
>

Mime
View raw message