hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jtay...@salesforce.com>
Subject Re: HBase Schema for IPTC News ML G2
Date Mon, 03 Mar 2014 21:32:33 GMT
Hi Jigar,
Take a look at Apache Phoenix: http://phoenix.incubator.apache.org/
It allows you to use SQL to query over your HBase data and supports
composite primary keys, so you could create a schema like this:

create table news_message(guid varchar not null, version bigint not null,
    constraint pk primary key (guid, version desc));

The rows will then sort by guid plus version descending. Then you can issue
sql queries directly against your hbase data without writing map/reduce.
Note that we don't yet support all the sql constructs that postgres does.

HTH,
James


On Sun, Mar 2, 2014 at 11:23 PM, Jigar Shah <jigar.shah@infodesk.com> wrote:

> I am working in news processing industry, current system processes more
> then million article per week. And provides this data in real time to
> users, additionally it provides search capabilities via Lucene.
>
> We convert all news to a standard IPTC NewsML
> G2<http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/ <
> http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/>>format,
> before providing it to users (in real-time or via search)
>
> We have a requirement of component which provides analytical queries on
> news data. I plan to load this all data in HBase and then have Map-Reduce
> Jobs to compute analytical queries. More over current system is developed
> on postgresql to store only 3 months data, anything more then this is big
> data as it dosen't fit on one server.
>
> But i am bit confused in developing schema for it.
>
> Every news article has
>
> *"messageID" as guid*, unique id for news message.
> *"version" as int,* incremented if newer version of same news message is
> published.
> there are other fields like location, channels, title, content, source
> etc..
>
> Current database primary key is a composite of (messageID & version).
>
> I thought that, i should use "messageID" as "rowKey" in HBase. and
> "version" as "columnFamily" and all columns will be fields of news (like
> location, channels ,title, body, sentTimstamp, ...)
>
> Keeping "version" as "columnFamily" is a good idea ?
>
> In reality "single message may have thousands of version".
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message