hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jigar Shah <jigar.s...@infodesk.com>
Subject Re: HBase Schema for IPTC News ML G2
Date Tue, 04 Mar 2014 05:24:17 GMT
Hello Ted,

I can think of implementation, based on which you provided solution.

Current use-case is like this:

Consider an application is getting news message (xml) which has 
(messageID & version) and other fields. For same news message i can get 
different versions, usually incremental.

e.g:
*messageId:0bb4b5bd-c06e-400a-8b08-e2b6960dda25* with *version:1*.
*messageId:0bb4b5bd-c06e-400a-8b08-e2b6960dda25* with *version:2*.

I am currently using postgres, and having composite key primary key with 
(messageID & version) and other columns stored in normalized way in 
database.

If i design storage in HBase. what should be my rowKey and column 
family. and how should i maintain multiple versions of same messageID.

I plan of keeping *messageId* as *rowKey* and *version* as *column 
family*. so at one point of time i can pick-up one column family (by 
version) and, will get all columns for that version
in particular message.

But if column family is pre-defined in HBase, then i think solution is 
not feasible.

Thanks,
Jigar Shah.


On 03/03/2014 06:32 PM, Ted Yu wrote:
> There seems to be some misunderstanding.
>
> The column families need to be defined at the time of table creation.
> My understanding was that there would be one column family called version.
> Each row in this table would have version number (1, 2, or 3, etc) in version column
family, along with details in the other column family.
> At query time, you specify a filter to get latest version from version column family
and load the other column family accordingly.
>
> Cheers
>
> On Mar 3, 2014, at 3:22 AM, Jigar Shah<jigar.shah@infodesk.com>  wrote:
>
>> Hi Ted,
>>
>> Thanks for reply.
>>
>> I am more concerned about structure, what should be rowKey and column families (having
each version of news as a column family will be a good idea ?).
>>
>> Will there be any problem if i orient my data in this way.
>>
>> |rowKey|                 | column-famlilies|
>> <guid>                    <1> <2>                             
    <version>
>> newsMessageId       someTitle                someTitle
>>                                 someDescription changedSomeDescription
>> location
>>
>>
>> newsMessageID as RowKey, versions of same news (News XML) as column family, fields
in XML as columns in respective version column family.
>>
>> If i have lot of versions for same message, I will have lot of column families.
>>
>> Does HBase have some limitations if i have undefined/large number of column families.
>>
>> Do you think i should orient data in different way ?
>>
>> System mostly queries latest version of news. But still we need to keep track of
all versions for particular news message.
>>
>> Good to know that column families can be lazily loaded, based on column filter.
>>
>> Thanks
>> Jigar Shah.
>>
>>
>> On 03/03/2014 03:48 PM, Ted Yu wrote:
>>> When version is in its own column family, you can utilize essential column family
support.
>>>
>>> Seehttps://issues.apache.org/jira/browse/HBASE-5416
>>>
>>> Cheers
>>>
>>> On Mar 2, 2014, at 11:31 PM, Jigar Shah<jigar.shah@infodesk.com>  wrote:
>>>
>>>> I am working in news processing industry, current system processes more
>>>> then million article per week. And provides this data in real time to
>>>> users, additionally it provides search capabilities via Lucene.
>>>>
>>>> We convert all news to a standard IPTC NewsML
>>>> G2<http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/  <http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/>>format,
>>>> before providing it to users (in real-time or via search)
>>>>
>>>> We have a requirement of component which provides analytical queries on
>>>> news data. I plan to load this all data in HBase and then have Map-Reduce
>>>> Jobs to compute analytical queries. More over current system is developed
>>>> on postgresql to store only 3 months data, anything more then this is big
>>>> data as it dosen't fit on one server.
>>>>
>>>> But i am bit confused in developing schema for it.
>>>>
>>>> Every news article has
>>>>
>>>> *"messageID" as guid*, unique id for news message.
>>>> *"version" as int,* incremented if newer version of same news message is
published.
>>>> there are other fields like location, channels, title, content, source etc..
>>>>
>>>> Current database primary key is a composite of (messageID & version).
>>>>
>>>> I thought that, i should use "messageID" as "rowKey" in HBase. and
>>>> "version" as "columnFamily" and all columns will be fields of news (like
location, channels ,title, body, sentTimstamp, ...)
>>>>
>>>> Keeping "version" as "columnFamily" is a good idea ?
>>>>
>>>> In reality "single message may have thousands of version".
>>>>
>>>> Or if any other solution when we have composite primary key in database.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message