hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ChristopherC <ch...@tubemogul.com>
Subject RE: Evaluation of HBase/Hadoop and some fundamental questions
Date Thu, 22 Jan 2009 18:06:07 GMT

Sorry - yes, we are currently testing normal API calls, not doing M/R for
queries. We're in the midst of trying to have a multi-threaded client all
make a call at the same time for different "sets" of data. Normally I think
this would be located together, but since there's a lot of it, we've tried
to spread it around. Then, each client thread will group and sort. However,
this still (most likely) won't totally scale well. The issue is - often
we'll be hit with one request for a page and then another. Each page would
pull the entire data set from the datanode to the client each time, when in
reality it's only a small subset that is displayed to the user.

We're currently examining options around this. Some intelligent caching and
possibly even extending HDFS to allow more API calls to work on the data
before it is returned. Basically, inject a class to run on the datanode and
have it run by HDFS (like a filter) before returning the file contents. 


jlist-3 wrote:
> 
> Christopher,
> 
> Currently there is no way to do "fast" MR jobs.  My understanding is that
> Google's MR implementation does allow for something like this, ie.
> realtime
> distributed queries on bigtable.
> 
> In our application we push a "query engine" to the application level on
> top
> of HBase with its own in-memory cache.  We hot-update into this cache and
> don't do any pre-materializing of results as there are far too many
> combinations.
> 
> Have you tried writing your queries not in MR, just as a java client app
> using the normal API?  You might be able to get acceptable performance
> that
> way, or at least a faster way to pre-cache query results.  How about using
> MR jobs to perform one level of grouping or sorting to store back in HBase
> that would allow you to do more efficient realtime queries?
> 
> JG
> 
> 
>> -----Original Message-----
>> From: ChristopherC [mailto:chris@tubemogul.com]
>> Sent: Thursday, January 22, 2009 3:41 AM
>> To: hbase-user@hadoop.apache.org
>> Subject: Evaluation of HBase/Hadoop and some fundamental questions
>> 
>> 
>> Hello,
>> 
>> We've got a data warehouse application exposed to a web front end.
>> We're a
>> DB backed site evaluating HBase. We have a typical OLTP type
>> application and
>> run a lot of ETL to load a data warehouse that servers our front end.
>> We've
>> outgrown the hardware and having massive scaling issues. After
>> evaluating
>> HBase we've discovered some thing and wanted to get some input. I'm
>> sure
>> others have and perhaps there are other technologies or idea for
>> solving
>> these problems.
>> 
>> First, the mapred function of Hadoop will solve all our ETL issues and
>> allow
>> that to scale. Appears we can insert our data in a more unstructured
>> manner
>> to save space and loading looks okay. We're primarily now focusing in
>> on
>> fast website access. HBase seems to be able to serve up requests quite
>> quickly now, however they are very basic access requests and I think
>> for
>> anything more complex we're stuck.
>> 
>> For example - a simple DW type query would be to generate a top-N-
>> result of
>> some fact data over a variable date range ordered in descending order.
>> In DB
>> terms, this means query across the date range, sum up the fact data and
>> group by the desired keys, then sort the results. On our database,
>> these
>> queries are not scaling as both disk access and sort time are too long.
>> The
>> front-end is suffering. Our hopes were to distribute this, but that's
>> not
>> really what HBase does. Ideally, if we created map-reduce jobs to
>> generate
>> the results of these queries and pre-store them, we could then retrieve
>> them
>> very quickly. However, this is similar to a materialized view and we
>> have
>> the same issue on a database. For every date range and every
>> combination of
>> dimension, storing every result grows exponentially and becomes
>> impossible.
>> 
>> So, we were looking at ideas to allow HBase to distribute the query. A
>> map-reduce job is ideal, but far too slow for a real-time web request.
>> It's
>> almost like we need a lighter version of that. For example, we just
>> want to
>> return a small amount of data. If the job runs into issues, who cares,
>> the
>> user can resubmit. The objective is low latency. Thus, we need to make
>> sure
>> we spread around the work. In HBase, if we are hitting a key, it's all
>> stored together. We've actually started structuring our keys in such a
>> way
>> to hash them out across a wide range, so data is spread out. We've been
>> testing have many threads perform a lookup to see the range to query
>> then
>> all hit at once. The issues still become - the only work the nodes are
>> doing
>> is retrieving the keys. They are doing no sorting, no pruning and it's
>> all
>> the client to take the data and merge and sort. This is inefficient as
>> it
>> may need to pull millions of rows to simply get the top 10. Ideally it
>> would
>> be better to do this on the nodes.
>> 
>> Obviously we're shoe-horning some of what HBase does to fit our needs.
>> But,
>> just curious as to how others have solved this issue. If there's any
>> work
>> anywhere or possible future versions which may address some of these
>> distributed queries.
>> 
>> Thanks
>> --
>> View this message in context: http://www.nabble.com/Evaluation-of-
>> HBase-Hadoop-and-some-fundamental-questions-tp21602482p21602482.html
>> Sent from the HBase User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Evaluation-of-HBase-Hadoop-and-some-fundamental-questions-tp21602482p21609874.html
Sent from the HBase User mailing list archive at Nabble.com.


Mime
View raw message