hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Rodionov <vrodio...@carrieriq.com>
Subject RE: HBase Schema for IPTC News ML G2
Date Tue, 04 Mar 2014 19:12:14 GMT
HBase supports natural versioning for free. It is cell's timestamp

Your cell address in HBase tables the following:


First on table, column family, column qualifier concept:

Table is similar to RDBMS table, but does not have rigid schema. When you create table in
HBase you need to specify at least one column family.
Column family groups columns (which are defined by column qualifiers) into physically single
storage file, frequently used together columns must be placed into
the same column family for performance reason.

Column qualifier is similar to RDBMS column, but HBase does not require ALL qualifiers to
be defined in advance, therefore rows in HBase table may have different sets of qualifiers

For your use case, there are two possible approaches:

1. rowkey = messageID and Version is in a timestamp (you can put any value instead of a time
or keep default timestamp)
2. rowkey = combination of messageID and Version

All above will give you ability to query N latest versions of a message, where N can be any
>= 1.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

From: Jigar Shah [jigar.shah@infodesk.com]
Sent: Monday, March 03, 2014 9:24 PM
To: user@hbase.apache.org
Subject: Re: HBase Schema for IPTC News ML G2

Hello Ted,

I can think of implementation, based on which you provided solution.

Current use-case is like this:

Consider an application is getting news message (xml) which has
(messageID & version) and other fields. For same news message i can get
different versions, usually incremental.

*messageId:0bb4b5bd-c06e-400a-8b08-e2b6960dda25* with *version:1*.
*messageId:0bb4b5bd-c06e-400a-8b08-e2b6960dda25* with *version:2*.

I am currently using postgres, and having composite key primary key with
(messageID & version) and other columns stored in normalized way in

If i design storage in HBase. what should be my rowKey and column
family. and how should i maintain multiple versions of same messageID.

I plan of keeping *messageId* as *rowKey* and *version* as *column
family*. so at one point of time i can pick-up one column family (by
version) and, will get all columns for that version
in particular message.

But if column family is pre-defined in HBase, then i think solution is
not feasible.

Jigar Shah.

On 03/03/2014 06:32 PM, Ted Yu wrote:
> There seems to be some misunderstanding.
> The column families need to be defined at the time of table creation.
> My understanding was that there would be one column family called version.
> Each row in this table would have version number (1, 2, or 3, etc) in version column
family, along with details in the other column family.
> At query time, you specify a filter to get latest version from version column family
and load the other column family accordingly.
> Cheers
> On Mar 3, 2014, at 3:22 AM, Jigar Shah<jigar.shah@infodesk.com>  wrote:
>> Hi Ted,
>> Thanks for reply.
>> I am more concerned about structure, what should be rowKey and column families (having
each version of news as a column family will be a good idea ?).
>> Will there be any problem if i orient my data in this way.
>> |rowKey|                 | column-famlilies|
>> <guid>                    <1> <2>                             
>> newsMessageId       someTitle                someTitle
>>                                 someDescription changedSomeDescription
>> location
>> newsMessageID as RowKey, versions of same news (News XML) as column family, fields
in XML as columns in respective version column family.
>> If i have lot of versions for same message, I will have lot of column families.
>> Does HBase have some limitations if i have undefined/large number of column families.
>> Do you think i should orient data in different way ?
>> System mostly queries latest version of news. But still we need to keep track of
all versions for particular news message.
>> Good to know that column families can be lazily loaded, based on column filter.
>> Thanks
>> Jigar Shah.
>> On 03/03/2014 03:48 PM, Ted Yu wrote:
>>> When version is in its own column family, you can utilize essential column family
>>> Seehttps://issues.apache.org/jira/browse/HBASE-5416
>>> Cheers
>>> On Mar 2, 2014, at 11:31 PM, Jigar Shah<jigar.shah@infodesk.com>  wrote:
>>>> I am working in news processing industry, current system processes more
>>>> then million article per week. And provides this data in real time to
>>>> users, additionally it provides search capabilities via Lucene.
>>>> We convert all news to a standard IPTC NewsML
>>>> G2<http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/  <http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/>>format,
>>>> before providing it to users (in real-time or via search)
>>>> We have a requirement of component which provides analytical queries on
>>>> news data. I plan to load this all data in HBase and then have Map-Reduce
>>>> Jobs to compute analytical queries. More over current system is developed
>>>> on postgresql to store only 3 months data, anything more then this is big
>>>> data as it dosen't fit on one server.
>>>> But i am bit confused in developing schema for it.
>>>> Every news article has
>>>> *"messageID" as guid*, unique id for news message.
>>>> *"version" as int,* incremented if newer version of same news message is
>>>> there are other fields like location, channels, title, content, source etc..
>>>> Current database primary key is a composite of (messageID & version).
>>>> I thought that, i should use "messageID" as "rowKey" in HBase. and
>>>> "version" as "columnFamily" and all columns will be fields of news (like
location, channels ,title, body, sentTimstamp, ...)
>>>> Keeping "version" as "columnFamily" is a good idea ?
>>>> In reality "single message may have thousands of version".
>>>> Or if any other solution when we have composite primary key in database.

Confidentiality Notice:  The information contained in this message, including any attachments
hereto, may be confidential and is intended to be read only by the individual or entity to
whom this message is addressed. If the reader of this message is not the intended recipient
or an agent or designee of the intended recipient, please note that any review, use, disclosure
or distribution of this message or its attachments, in any form, is strictly prohibited. 
If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com
and delete or destroy any copy of this message and its attachments.

View raw message