hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: HBase Schema for IPTC News ML G2
Date Mon, 03 Mar 2014 13:02:19 GMT
There seems to be some misunderstanding. 

The column families need to be defined at the time of table creation. 
My understanding was that there would be one column family called version.
Each row in this table would have version number (1, 2, or 3, etc) in version column family,
along with details in the other column family. 
At query time, you specify a filter to get latest version from version column family and load
the other column family accordingly. 

Cheers

On Mar 3, 2014, at 3:22 AM, Jigar Shah <jigar.shah@infodesk.com> wrote:

> Hi Ted,
> 
> Thanks for reply.
> 
> I am more concerned about structure, what should be rowKey and column families (having
each version of news as a column family will be a good idea ?).
> 
> Will there be any problem if i orient my data in this way.
> 
> |rowKey|                 | column-famlilies|
> <guid>                    <1> <2>                                 
<version>
> newsMessageId       someTitle                someTitle
>                                someDescription changedSomeDescription
> location
> 
> 
> newsMessageID as RowKey, versions of same news (News XML) as column family, fields in
XML as columns in respective version column family.
> 
> If i have lot of versions for same message, I will have lot of column families.
> 
> Does HBase have some limitations if i have undefined/large number of column families.
> 
> Do you think i should orient data in different way ?
> 
> System mostly queries latest version of news. But still we need to keep track of all
versions for particular news message.
> 
> Good to know that column families can be lazily loaded, based on column filter.
> 
> Thanks
> Jigar Shah.
> 
> 
> On 03/03/2014 03:48 PM, Ted Yu wrote:
>> When version is in its own column family, you can utilize essential column family
support.
>> 
>> See https://issues.apache.org/jira/browse/HBASE-5416
>> 
>> Cheers
>> 
>> On Mar 2, 2014, at 11:31 PM, Jigar Shah <jigar.shah@infodesk.com> wrote:
>> 
>>> I am working in news processing industry, current system processes more
>>> then million article per week. And provides this data in real time to
>>> users, additionally it provides search capabilities via Lucene.
>>> 
>>> We convert all news to a standard IPTC NewsML
>>> G2<http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/ <http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/>>format,
>>> before providing it to users (in real-time or via search)
>>> 
>>> We have a requirement of component which provides analytical queries on
>>> news data. I plan to load this all data in HBase and then have Map-Reduce
>>> Jobs to compute analytical queries. More over current system is developed
>>> on postgresql to store only 3 months data, anything more then this is big
>>> data as it dosen't fit on one server.
>>> 
>>> But i am bit confused in developing schema for it.
>>> 
>>> Every news article has
>>> 
>>> *"messageID" as guid*, unique id for news message.
>>> *"version" as int,* incremented if newer version of same news message is published.
>>> there are other fields like location, channels, title, content, source etc..
>>> 
>>> Current database primary key is a composite of (messageID & version).
>>> 
>>> I thought that, i should use "messageID" as "rowKey" in HBase. and
>>> "version" as "columnFamily" and all columns will be fields of news (like location,
channels ,title, body, sentTimstamp, ...)
>>> 
>>> Keeping "version" as "columnFamily" is a good idea ?
>>> 
>>> In reality "single message may have thousands of version".
>>> 
>>> Or if any other solution when we have composite primary key in database.
> 
> 

Mime
View raw message