hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: HBase Schema for IPTC News ML G2
Date Mon, 03 Mar 2014 10:18:51 GMT
When version is in its own column family, you can utilize essential column family support.


See https://issues.apache.org/jira/browse/HBASE-5416

Cheers

On Mar 2, 2014, at 11:31 PM, Jigar Shah <jigar.shah@infodesk.com> wrote:

> I am working in news processing industry, current system processes more
> then million article per week. And provides this data in real time to
> users, additionally it provides search capabilities via Lucene.
> 
> We convert all news to a standard IPTC NewsML
> G2<http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/ <http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/>>format,
> before providing it to users (in real-time or via search)
> 
> We have a requirement of component which provides analytical queries on
> news data. I plan to load this all data in HBase and then have Map-Reduce
> Jobs to compute analytical queries. More over current system is developed
> on postgresql to store only 3 months data, anything more then this is big
> data as it dosen't fit on one server.
> 
> But i am bit confused in developing schema for it.
> 
> Every news article has
> 
> *"messageID" as guid*, unique id for news message.
> *"version" as int,* incremented if newer version of same news message is published.
> there are other fields like location, channels, title, content, source etc..
> 
> Current database primary key is a composite of (messageID & version).
> 
> I thought that, i should use "messageID" as "rowKey" in HBase. and
> "version" as "columnFamily" and all columns will be fields of news (like location, channels
,title, body, sentTimstamp, ...)
> 
> Keeping "version" as "columnFamily" is a good idea ?
> 
> In reality "single message may have thousands of version".
> 
> Or if any other solution when we have composite primary key in database.

Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message