Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of rmorgan466@gmail.com designates
 209.85.160.169 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <E11A4431-1E49-4E28-8FA1-3B3C596E0E26@salesforce.com>
References: 
 <CAOF-Kfj7eqbRGj=PjPz8iQ6vqn5pvdXZK0PX50KLhBEbDYR2vg@mail.gmail.com>
	<E11A4431-1E49-4E28-8FA1-3B3C596E0E26@salesforce.com>
Date: Thu, 25 Aug 2011 10:53:07 -0400
Message-ID: 
 <CAOF-Kfi_NTao2WAXQse1XsDRmkZ0MxXDvXjrb4d_eBLJ3-DHUQ@mail.gmail.com>
Subject: Re: schema help
From: Rita <rmorgan466@gmail.com>
To: Ian Varley <ivarley@salesforce.com>
Cc: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=20cf300fb30736834e04ab5596b5

--20cf300fb30736834e04ab5596b5
Content-Type: text/plain; charset=ISO-8859-1

Thanks for your reponse.

30 million rows is the best case :-)

Couple of questions about doing, [fieldA][time] as my key:
  Would I have to insert in order?
  If no, how would hbase know to stop scanning the entire table?
  How would a query actually look like, if my key was [fieldA time]?

As a matter of fact, I can do 100% of my queries. I will leave the 5% out of
my project/schema.


On Thu, Aug 25, 2011 at 10:13 AM, Ian Varley <ivarley@salesforce.com> wrote:

> Rita,
>
> There's no need to create separate tables here--the table is really just a
> "namespace" for keys. A better option would probably be having one table
> with "[fieldA][time]" (the two fields concatenated) as your row key. Then,
> you can seek directly to the start of your records in constant time, and
> then scan forward until you get to the end of the data (linear time in the
> size of data you expect to get back).
>
> The downside of this is that for the 5% of your queries that aren't in this
> form, you may have to do a full table scan. (Alternately, you could also
> maintain secondary indexes that help you get the data back with less than a
> full table scan; that would depend on the nature of the queries).
>
> In general, a good rule of thumb when designing a schema in HBase is, think
> first about how you'd ideally like to access the data. Then structure the
> data to match that access pattern. (This is obviously not ideal if you have
> lots of different access patterns, but then, that's what relational
> databases are for. Most commercial relational DBs wouldn't blink at doing
> analytical queries against 30 million rows.)
>
> Ian
>
> On Aug 25, 2011, at 9:03 AM, Rita wrote:
>
> Hello,
>
> I am trying to solve a time related problem. I can certainly use opentsdb
> for this but was wondering if anyone had a clever way to create this type
> of
> schema.
>
> I have an inventory table,
>
> time (unix epoch), fieldA, fieldB, data
>
>
> There are about 30 million of these entries.
>
> 95% of my queries will look like this:
> show me where fieldA=zCORE from range [1314180693 to now]
>
> for fieldA, there is a possibility of 4000 unique items.
> for fieldB, there is a possibility of 2 unique items (bool).
>
> So, I was thinking of creating 4000*2 tables and place the data like that
> so
> I can easly scan.
>
> Any thoughts about this? Will hbase freak out if i have 8000 tables?
>
>
>
>
>
>
> --
> --- Get your facts first, then you can distort them as you please.--
>
>
>


-- 
--- Get your facts first, then you can distort them as you please.--

--20cf300fb30736834e04ab5596b5--