hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jl...@streamy.com>
Subject Re: Public HBase data store?
Date Tue, 18 Aug 2009 23:59:51 GMT

We do things like that.  Both out of search indexes as well as to 
perform simple "joins" where one table might have an ordered list of ids 
in a family together, we grab a "page", and then perform a join by 
grabbing a set of columns from a different table, one row per id.

Yes joins can be a dirty word but in the cases where we do simple joins 
like this, the data is duplicated so many times that denormalization is 
not feasible.  And in your case, actually storing the data fields in 
Lucene is extremely expensive, so it can certainly make sense.

One thing... If you are going to have a number of "get by key" calls for 
a single query/page, running them in parallel can significantly improve 
total time.  This is especially the case if the keys you need to query 
for are well dispersed across the table (so you can hit multiple 


tim robertson wrote:
> Hi Ryan,
> What kind of random row lookup throughput do you get (e.g. rows per
> second) on the 10b store on the 20 machine cluster (assuming client
> isn't saturating)?
> I'm pondering indexing hbase rows in various ways with Lucene with
> only the row key stored.  Then page over search results and stream out
> response (transforming to preferred response format on the fly - RDF,
> CSV, XML etc) by doing sequential "get by key" calls.  Maybe stupid
> idea, but not sure what else can index so well.
> I'm just curious...
> Thanks,
> Tim
> On Tue, Aug 18, 2009 at 10:07 PM, Ryan Rawson<ryanobjc@gmail.com> wrote:
>> I run real machines, they aren't too expensive and are substantially
>> more performant than the virtualized servers EC2 offers. I have 10b
>> rows loaded on 20 machines, but you could probably do that on 10 or
>> so. Don't forget that 10b rows would require a $40000 machine to use
>> on mysql, so why not spend $40000 on a cluster?
>> On Tue, Aug 18, 2009 at 12:20 PM, Jonathan Gray<jlist@streamy.com> wrote:
>>> I have a little util I created called HBench.  You can customize the
>>> different parameters to generate data of varying sizes/patterns/etc.
>>> https://issues.apache.org/jira/browse/HBASE-1501
>>> JG
>>> Andrew Purtell wrote:
>>>> Most that I am aware of set up transient test environments up on EC2.
>>>> You can use one instance to create an EBS volume containing all software
>>>> and config you need, then snapshot it, then clone volumes based on the
>>>> snapshot to attach to any number of instances you need. Use X-Large
>>>> instances, at least 4. Give HBase regionservers 2GB heap. Then try your
>>>> 10 billion row test case.
>>>>   - Andy
>>>> ________________________________
>>>> From: Greg Cottman <greg.cottman@quest.com>
>>>> To: "hbase-user@hadoop.apache.org" <hbase-user@hadoop.apache.org>
>>>> Sent: Tuesday, August 18, 2009 4:13:23 PM
>>>> Subject: Public HBase data store?
>>>> Hi all,
>>>> I need to do some scalability testing of an HBase query tool.  We have
>>>> just started using HBase and sadly do not have an existing database against
>>>> which to test.  Things we are interested in exploring is the difference
>>>> between using an index table strategy versus map/reduce queries without
>>>> indexes.
>>>> I realise this is a long shot and that queries are very data-dependent,
>>>> but...  Are there any publicly accessible HBase stores or reference sites
>>>> against which you can run test queries?
>>>> Or does everyone just create a 10 billion row test environment on their
>>>> local development box?  :-)
>>>> Cheers,
>>>> Greg.

View raw message