hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Robertson <timrobertson...@gmail.com>
Subject Re: HFileInputFormat for MapReduce
Date Thu, 09 Feb 2012 08:55:05 GMT
Thanks Amandeep,

So this reply is a little off topic from the original.  I really
appreciate your interest and any ideas you may have - we are a small
team doing what we can with limited resources and a small cluster.

Our data is point based species observation data, crawled from
distributed databases over the www into a central store.  Currently
that is MySQL, and we Sqoop out and do a lot in Hive already as MySQL
is crippling us.  Today we understand 23 fields associated with a
point, but we want to grow to 100s of fields and offer ad-hoc
reporting services on those (e.g. filtering, group by, massaging into
new formats etc).  In addition to the ad hoc reporting, we seek to
offer a central identifier resolution service to reconcile multiple
IDs for the same digital object (random read).  We'll over services to
allow people improve the quality of those digital objects (random
write).  We are exploring HBase to provide:

- A store for the spiders to CheckOrPut into, without being locked on
any scans (MyISAM table locking an issue)
- The ability incrementally improve our interpretation of records with
more fields (e.g. add columns easily, where MySQL is suffering badly
on schema changes)
- Multiple "primary keys" for a record (secondary indexes)
- High random access (e.g. google map tiles)
- Low random write (community curation of content)

However, we also do a lot of full scans and metrics. For example
please zoom in on the Google map on http://eol.org/pages/922241/maps.
We have 1 million species, and calculate density maps to all zoom
levels, so end up with billions of Google tiles - basically we do huge
GROUP BY queries and export to billion row MySQL tables (those tiles
are rendered real time from MySQL using custom code).  Ideally we
would like to mount Hive tables as a set of "views" on HBase for the
scans and ad hoc analysis, but finding that it is 10 times slower than
the same on a Hive tables backed by a text file export from MySQL - we
are investigating that (PerformanceEvaluation thread on this list), as
it sounds suspiciously low.

>From the limitations you mention, 1) and 2) we can live with, but 3)
could be why my quick tests are already giving incorrect record
counts.  That sounds like a show stopper straight away right?

One option for us would be HBase for the primary store for random
access, and periodic (e.g. 12 hourly) exports to HDFS for all the full
scanning.  Would you consider that sane?


On Thu, Feb 9, 2012 at 9:24 AM, Amandeep Khurana <amansk@gmail.com> wrote:
> Tim
> Going directly to HFiles has the following pitfalls:
> 1. You'll miss out on data that's in the memstore and has not been
> flushed to an HFile yet.
> 2. If you have deletes, you'll probably see the data from some HFiles
> where the data resides since a compactions hasn't taken place to throw
> it out yet.
> 3. Different values owing to different versions residing in different HFiles.
> In short, you miss out in the reconciliation that gets done at the RS level.
> If all you want to do it MR jobs over the data in HBase, why not
> consider flat files and just run em over that? Maybe run Hive queries,
> since you mentioned that. Why use HBase at all?
> (I'm not trying to shoo you away from HBase. Just curious what you are
> trying to accomplish)
> Amandeep
> On Feb 9, 2012, at 12:19 AM, Tim Robertson <timrobertson100@gmail.com> wrote:
>> Hi all,
>> Can anyone elaborate on the pitfalls or implications of running
>> MapReduce using an HFileInputFormat extending FileInputFormat?
>> I'm sure scanning goes through the RS for good reasons (guessing
>> handling splits, locking, RS monitoring etc) but can it ever be "safe"
>> to run MR over HFiles directly?  E.g. For scenarios like a a region
>> split,  would the MR just get stale data or would _bad_things_happen_?
>> For our use cases we could tolerate stale data, the occasional MR
>> failure on a node dropping out, and if we could detect a region split
>> we can suspend MR jobs on the HFile until the split is finished.  We
>> don't anticipate huge daily growth, but a lot of scanning and random
>> access.
>> I knocked up a quick example porting the Scala version of HFIF [1] to
>> Java [2] and full data scans appear to be an order of magnitude
>> quicker (30 -> 3 mins), but I suspect this is *fraught* with dangers.
>> If not, I'd like to try and take this further, possibly with Hive.
>> Thanks,
>> Tim
>> [1] https://gist.github.com/1120311
>> [2] http://pastebin.com/e5qeKgAd

View raw message