hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Dimiduk <ndimi...@gmail.com>
Subject Re: Design a datastore maintaining historical view of users.
Date Tue, 13 Jan 2015 20:54:29 GMT
With an entity-centric data model (i.e. customer_id as row key), you're
looking at a full table scan for every query. 30-minute SLA puts you well
within the realm of a MapReduce/Cascading/Pig/Hive/Tez/Spark job. HBase can
work fine for this, but since you're not really in the low-latency world,
perhaps you'd consider a more analytical storage system (i.e., HDFS +
ORC/Parquet). Of course, if your data is extremely sparse, you'll land back
here at HBase.

You can achieve lower latencies with HBase by pushing query components into
the row key. However, if the queries are truly adhoc, you'll probably want
secondary indices. Apache Phoenix is a great choice if you decide to pursue
this route. ES may also be a reasonable choice here, but it depends on many
other factors, including 'scale' and your philosophy about indices as a
data storage medium.

If time is a frequent component of your query patterns, I recommend you
model is directly in your schema. You'll have more flexibility and better
performance than if you rely on HBase's timestamp for this attribute.

-n

On Mon, Jan 12, 2015 at 4:42 PM, Chen Wang <chen.apache.solr@gmail.com>
wrote:

> Hey Guys,
> I am seeking advice on design a system that maintains a historical view of
> a user's activities in past one year. Each user can have different
> activities: email_open, email_click, item_view, add_to_cart, purchase etc.
> The query I would like to do is, for example,
>
> Find all customers who browse item A in the past 6 month, and also clicked
> an email.
> and I would like the query to be done in reasonable time frame. (for
> example, within 30 minutes to retrieve 10million such users)
>
> Since we already have HBase cluster in place, HBase becomes my first
> choice. So I can have customer_id as the row key, column family be
> 'Activity', then have certain attributes associated with the column
> family,something like:
>
> custer_id, browse:{item_id:12334, timestamp:epoc}
>
> However, It seems that HBase would not be a good choice for supporting the
> queries above. Even its possible with scan, it will be super inefficient
> due to the size of the data set.
>
> Is my understanding correct and I should resort to other data store.(ES in
> my opinion). or has anyone done similar thing with HBase?
>
> Thanks in advance.
> Chen
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message