hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Boaretto, Ricardo" <rboare...@gmail.com>
Subject Re: HBase Schema for IPTC News ML G2
Date Mon, 03 Mar 2014 08:25:57 GMT

How frequent do you need to query older versions of some message?

Ricardo Boaretto.
On Mar 3, 2014 4:31 AM, "Jigar Shah" <jigar.shah@infodesk.com> wrote:

> I am working in news processing industry, current system processes more
> then million article per week. And provides this data in real time to
> users, additionally it provides search capabilities via Lucene.
> We convert all news to a standard IPTC NewsML
> G2<http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/ <
> http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/>>format,
> before providing it to users (in real-time or via search)
> We have a requirement of component which provides analytical queries on
> news data. I plan to load this all data in HBase and then have Map-Reduce
> Jobs to compute analytical queries. More over current system is developed
> on postgresql to store only 3 months data, anything more then this is big
> data as it dosen't fit on one server.
> But i am bit confused in developing schema for it.
> Every news article has
> *"messageID" as guid*, unique id for news message.
> *"version" as int,* incremented if newer version of same news message is
> published.
> there are other fields like location, channels, title, content, source
> etc..
> Current database primary key is a composite of (messageID & version).
> I thought that, i should use "messageID" as "rowKey" in HBase. and
> "version" as "columnFamily" and all columns will be fields of news (like
> location, channels ,title, body, sentTimstamp, ...)
> Keeping "version" as "columnFamily" is a good idea ?
> In reality "single message may have thousands of version".
> Or if any other solution when we have composite primary key in database.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message