hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jigar Shah <jigar.s...@infodesk.com>
Subject Re: HBase Schema for IPTC News ML G2
Date Mon, 03 Mar 2014 09:19:48 GMT
Hello Boaretto, Ricardo,

Thanks for reply.

Query on older versions of message is less frequent. Application 
provides a flag "ignoreOldRevisions" (default value is 'true').

Latest versions are of more importance in general. But still system need 
to keep track of all versions received for particular message.

Jigar Shah.

On 03/03/2014 01:55 PM, Boaretto, Ricardo wrote:
> Hi,
> How frequent do you need to query older versions of some message?
> Regards,
> Ricardo Boaretto.
> On Mar 3, 2014 4:31 AM, "Jigar Shah" <jigar.shah@infodesk.com> wrote:
>> I am working in news processing industry, current system processes more
>> then million article per week. And provides this data in real time to
>> users, additionally it provides search capabilities via Lucene.
>> We convert all news to a standard IPTC NewsML
>> G2<http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/ <
>> http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/>>format,
>> before providing it to users (in real-time or via search)
>> We have a requirement of component which provides analytical queries on
>> news data. I plan to load this all data in HBase and then have Map-Reduce
>> Jobs to compute analytical queries. More over current system is developed
>> on postgresql to store only 3 months data, anything more then this is big
>> data as it dosen't fit on one server.
>> But i am bit confused in developing schema for it.
>> Every news article has
>> *"messageID" as guid*, unique id for news message.
>> *"version" as int,* incremented if newer version of same news message is
>> published.
>> there are other fields like location, channels, title, content, source
>> etc..
>> Current database primary key is a composite of (messageID & version).
>> I thought that, i should use "messageID" as "rowKey" in HBase. and
>> "version" as "columnFamily" and all columns will be fields of news (like
>> location, channels ,title, body, sentTimstamp, ...)
>> Keeping "version" as "columnFamily" is a good idea ?
>> In reality "single message may have thousands of version".
>> Or if any other solution when we have composite primary key in database.

View raw message