hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sheng Chen <chensheng2...@gmail.com>
Subject Re: schema help
Date Mon, 29 Aug 2011 02:45:22 GMT
Thanks all.

The HFile and key range I meant are all within one region.
If the compactions are not done in time, it is possible to have many HFiles
in a region holding most of the key range of the region.
When reading, HBase will have to read many HFiles that may hold the key
until it finds the right one.
Will the bloom filter solve this problem? Or, do I always need to compact
when a region is holding hundreds of hfiles?


Regards,
Sean


2011/8/27 Doug Meil <doug.meil@explorysmedical.com>

>
> +1 on everything said so far...
>
> Sean, you might also want to check this:
> http://hbase.apache.org/book.html#architecture
>
>
>
>
>
> On 8/26/11 2:50 PM, "lars hofhansl" <lhofhansl@yahoo.com> wrote:
>
> >In nutshell a change to HBase is performed like this:
> >1. the WAL entry is written and sync'ed to disk
> >2. The memstore is updated (that's just a cache in memory).
> >3. When memstore reaches a certain size it is flushed to create a new
> >file.
> >4. When a certain number of files is reached, they are compacted
> >(combined into fewer files)
> >
> >
> >When you do a read, HBase scans the memstore and all relevant store files.
> >It does that similar to what a mergesort does.
> >
> >-- Lars
> >
> >
> >
> >________________________________
> >From: Sheng Chen <chensheng2010@gmail.com>
> >To: user@hbase.apache.org
> >Sent: Thursday, August 25, 2011 11:08 PM
> >Subject: Re: schema help
> >
> >If the rows are added with random keys and flushed periodically, is it
> >possible that every hfile holds almost the whole key range?
> >Will it affect the random read performance, before the compaction is done?
> >
> >Thanks.
> >
> >Sean
> >
> >2011/8/25 Ian Varley <ivarley@salesforce.com>
> >
> >> The rows don't need to be inserted in order; they're maintained in
> >> key-sorted order on the disk based on the architecture of HBase, which
> >> stores data sorted in memory and periodically flushes to immutable
> >>files in
> >> HDFS (which are later compacted to make read access more efficient).
> >>HBase
> >> keeps track of which physical files might contain a given key range, and
> >> only reads the ones it needs to.
> >>
> >> To do a query through the java API, you could create a scanner with a
> >> startrow that is the concatenation of your value for fieldA and the
> >>start
> >> time, and an endrow that has the current time.
> >>
> >>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html
> >>
> >> Ian
> >>
> >> On Aug 25, 2011, at 9:53 AM, Rita wrote:
> >>
> >> Thanks for your reponse.
> >>
> >> 30 million rows is the best case :-)
> >>
> >> Couple of questions about doing, [fieldA][time] as my key:
> >>  Would I have to insert in order?
> >>  If no, how would hbase know to stop scanning the entire table?
> >>  How would a query actually look like, if my key was [fieldA time]?
> >>
> >> As a matter of fact, I can do 100% of my queries. I will leave the 5%
> >>out
> >> of my project/schema.
> >>
> >>
> >> On Thu, Aug 25, 2011 at 10:13 AM, Ian Varley <ivarley@salesforce.com
> >> <mailto:ivarley@salesforce.com>> wrote:
> >> Rita,
> >>
> >> There's no need to create separate tables here--the table is really
> >>just a
> >> "namespace" for keys. A better option would probably be having one table
> >> with "[fieldA][time]" (the two fields concatenated) as your row key.
> >>Then,
> >> you can seek directly to the start of your records in constant time, and
> >> then scan forward until you get to the end of the data (linear time in
> >>the
> >> size of data you expect to get back).
> >>
> >> The downside of this is that for the 5% of your queries that aren't in
> >>this
> >> form, you may have to do a full table scan. (Alternately, you could also
> >> maintain secondary indexes that help you get the data back with less
> >>than a
> >> full table scan; that would depend on the nature of the queries).
> >>
> >> In general, a good rule of thumb when designing a schema in HBase is,
> >>think
> >> first about how you'd ideally like to access the data. Then structure
> >>the
> >> data to match that access pattern. (This is obviously not ideal if you
> >>have
> >> lots of different access patterns, but then, that's what relational
> >> databases are for. Most commercial relational DBs wouldn't blink at
> >>doing
> >> analytical queries against 30 million rows.)
> >>
> >> Ian
> >>
> >> On Aug 25, 2011, at 9:03 AM, Rita wrote:
> >>
> >> Hello,
> >>
> >> I am trying to solve a time related problem. I can certainly use
> >>opentsdb
> >> for this but was wondering if anyone had a clever way to create this
> >>type
> >> of
> >> schema.
> >>
> >> I have an inventory table,
> >>
> >> time (unix epoch), fieldA, fieldB, data
> >>
> >>
> >> There are about 30 million of these entries.
> >>
> >> 95% of my queries will look like this:
> >> show me where fieldA=zCORE from range [1314180693 to now]
> >>
> >> for fieldA, there is a possibility of 4000 unique items.
> >> for fieldB, there is a possibility of 2 unique items (bool).
> >>
> >> So, I was thinking of creating 4000*2 tables and place the data like
> >>that
> >> so
> >> I can easly scan.
> >>
> >> Any thoughts about this? Will hbase freak out if i have 8000 tables?
> >>
> >>
> >>
> >>
> >>
> >>
> >> --
> >> --- Get your facts first, then you can distort them as you please.--
> >>
> >>
> >>
> >>
> >> --
> >> --- Get your facts first, then you can distort them as you please.--
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message