hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Advices for HTable schema
Date Tue, 03 Jul 2012 14:03:15 GMT
Comparisons are fine. 

Try to not think of this in terms of rows and columns, but in terms of records.
Think of each record as being atomic.  
Create a list of all of the components that make up that record. 
Then combine like components in to structures. 

Like the Street Address.  Add in a couple of fields to suggest when the person lived there.
If there is no end date, it must be a current address. 
You could put them in an Array however array's imply a finite size. Ordered set or list would
be more appropriate. 
Each of these structures then becomes a column. 

On Jul 3, 2012, at 7:31 AM, Jean-Marc Spaggiari wrote:

> Hi Michael,
> I'm trying to deeply dive into HBase and forget all my RDBMS knowledge
> but sometime it's difficult to not try to compare and I don't have yet
> all the right thinking mechanism. The more Amandeep was replying
> yesterday, more clear it become, but seems I still have a LOT to
> learn.
> I will never update one single value from the data I have. I will
> update all the columns for one row, or not any. When I need to ready
> them, I usually need to read all of them, or almost all. Not just one.
> I moved to a multiple columns architecture because I did the
> application with MySQL first but the more I read, the more I see that
> it's not the right way.
> I can have 2 tables.
> One with a key made with the person ID, and only one single CF and one
> C with everything into a single cell stored as a JSON output
> serialized using AVRO like you are suggesting.
> And a second table with rows ike PERSONID_PERSONADDRESS with a dummy
> CF and C just to keep one cell.
> At the end, that will meet all my needs but that will ask a bit more
> thinking. And it's so far from the initial design! But I think that's
> definitively a good solution.
> Thanks!
> JM
> 2012/7/3, Michael Segel <michael_segel@hotmail.com>:
>> Hi,
>> You're over thinking this.
>> Take a step back and remember that you can store anything you want as a byte
>> stream in a column.
>> Literally.
>> So you have a record that could be a text blob. Store it in one column. Use
>> JSON to define its structure and fields.
>> The only thing that makes it difficult is that you will need to pull out
>> everything just to insert or update something.
>> So then maybe segment your data in to logical blocks. Like a column that
>> stores the physical attributes of the person.
>> Another column that stores the list of addresses for the person.
>> Another column that stores the list of aliases used by the person.
>> Don't think in relational terms. HBase isn't relational and ER is not the
>> best way to model in a NoSQL database.
>> Think IMS/COBOL (mainframe) or Dick Pick's Revelation's OS.
>> The only relationships in HBase are weak relationships between tables.
>> Column Families currently have some nasty side effects that you may want to
>> consider how you apply them.
>> Think in terms of records.
>> Look at storing data using Avro.
>> On Jul 2, 2012, at 8:56 PM, Jean-Marc Spaggiari wrote:
>>> 2012/7/2, Amandeep Khurana <amansk@gmail.com>:
>>>>> Here are the 2 options now. Both with a new table.
>>>>> 1) I store the key "personID" and a:a1 to a:an for the addresses.
>>>>> 2) I store the key "personID" + "address
>>>>> In both I will have the same amount of data. In #1 total size will be
>>>>> smaller since the key will be stored only once.
>>>> The size will be the same. The underlying HFile will store 1 row per
>>>> cell
>>>> and the number of cells in both cases is the same.
>>>> However, the first approach with multiple columns for addresses needs you
>>>> to
>>>> keep track of the number and makes updates, deletes, additions
>>>> complicated
>>>> as I highlighted earlier. The second option with putting both things in
>>>> the
>>>> key makes life much easier.
>>>> If the data is primarily being accessed independently, I'd go with option
>>>> 2.
>>> Oh! I see! My misunderstanding comes from from my lack of HBase
>>> knowledge/reflex. I forgot it was storing the data that way. So I
>>> think I will most probably give a try to this 2nd option! Thanks for
>>> sharing your ideas all over the day.
>>> JM

View raw message