hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Schäfer <syrious3...@yahoo.de>
Subject Re: How to query by rowKey-infix
Date Fri, 03 Aug 2012 09:34:59 GMT
Hi Matt,

sure I got this in mind as an last option (at least on a limited subset of data).

Due to our estimation of some billions rows a week a selective filtering needs to take place
at the server side.

But I agree that one could do fine filtering stuff on the client side on a handy data subset
to avoid getting the hbase schema & indexing (by coprocessors) too complicated.

regards
Chris



----- Ursprüngliche Message -----
Von: Matt Corgan <mcorgan@hotpads.com>
An: user@hbase.apache.org
CC: 
Gesendet: 3:29 Freitag, 3.August 2012
Betreff: Re: How to query by rowKey-infix

Yeah - just thought i'd point it out since people often have small tables
in their cluster alongside the big ones, and when generating reports,
sometimes you don't care if it finishes in 10 minutes vs an hour.


On Thu, Aug 2, 2012 at 6:15 PM, Alex Baranau <alex.baranov.v@gmail.com>wrote:

> I think this is exactly what Christian is trying to (and should be trying
> to) avoid ;).
>
> I can't imagine use-case when you need to filter something and you can do
> it with (at least) server-side filter, and yet in this situation you want
> to try to do it on the client-side... Doing filtering on client-side when
> you can do it on server-side just feels wrong. Esp. given that there's a
> lot of data in HBase (otherwise why would you use it).
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
> On Thu, Aug 2, 2012 at 7:09 PM, Matt Corgan <mcorgan@hotpads.com> wrote:
>
> > Also Christian, don't forget you can read all the rows back to the client
> > and do the filtering there using whatever logic you like.  HBase Filters
> > can be thought of as an optimization (predicate push-down) over
> client-side
> > filtering.  Pulling all the rows over the network will be slower, but I
> > don't think we know enough about your data or speed requirements to rule
> it
> > out.
> >
> >
> > On Thu, Aug 2, 2012 at 3:57 PM, Alex Baranau <alex.baranov.v@gmail.com
> > >wrote:
> >
> > > Hi Christian!
> > >
> > > If to put off secondary indexes and assume you are going with "heavy
> > > scans", you can try two following things to make it much faster. If
> this
> > is
> > > appropriate to your situation, of course.
> > >
> > > 1.
> > >
> > > > Is there a more elegant way to collect rows within time range X?
> > > > (Unfortunately, the date attribute is not equal to the timestamp that
> > is
> > > stored by hbase automatically.)
> > >
> > > Can you set timestamp of the Puts to the one you have in row key?
> Instead
> > > of relying on the one that HBase puts automatically (current ts). If
> you
> > > can, this will improve reading speed a lot by setting time range on
> > > scanner. Depending on how you are writing your data of course, but I
> > assume
> > > that you mostly write data in "time-increasing" manner.
> > >
> > > 2.
> > >
> > > If your userId has fixed length, or you can change it so that it has
> > fixed
> > > length, then you can actually use smth like "wildcard"  in row key.
> > There's
> > > a way in Filter implementation to fast-forward to the record with
> > specific
> > > row key and by doing this skip many records. This might be used as
> > follows:
> > > * suppose your userId is 5 characters in length
> > > * suppose you are scanning for records with time between 2012-08-01
> > > and 2012-08-08
> > > * when you scanning records and you face e.g. key
> > > "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can
> tell
> > > the scanner from your filter to fast-forward to key "aaaab_
> 2012-08-01".
> > > Because you know that all remained records of user "aaaaa" don't fall
> > into
> > > the interval you need (as the time for its records will be >=
> > 2012-08-09).
> > >
> > > As of now, I believe you will have to implement your custom filter to
> do
> > > that.
> > > Pointer:
> > > org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT
> > > I believe I implemented similar thing some time ago. If this idea works
> > for
> > > you I could look for the implementation and share it if it helps. Or
> may
> > be
> > > even simply add it to HBase codebase.
> > >
> > > Hope this helps,
> > >
> > > Alex Baranau
> > > ------
> > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> ElasticSearch
> > -
> > > Solr
> > >
> > >
> > > On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <
> syrious3000@yahoo.de
> > > >wrote:
> > >
> > > >
> > > >
> > > > Excuse my double posting.
> > > > Here is the complete mail:
> > > >
> > > >
> > > > OK,
> > > >
> > > > at first I will try the scans.
> > > >
> > > > If that's too slow I will have to upgrade hbase (currently
> > 0.90.4-cdh3u2)
> > > > to be able to use coprocessors.
> > > >
> > > >
> > > > Currently I'm stuck at the scans because it requires two steps
> > (therefore
> > > > maybe some kind of filter chaining is required)
> > > >
> > > >
> > > > The key:  userId-dateInMillis-sessionId
> > > >
> > > > At first I need to extract dateInMllis with regex or substring (using
> > > > special delimiters for date)
> > > >
> > > > Second, the extracted value must be parsed to Long and set to a
> > RowFilter
> > > > Comparator like this:
> > > >
> > > > scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new
> > > > BinaryComparator(Bytes.toBytes((Long)dateInMillis))));
> > > >
> > > > How to chain that?
> > > > Do I have to write a custom filter?
> > > > (Would like to avoid that due to deployment)
> > > >
> > > > regards
> > > > Chris
> > > >
> > > > ----- Ursprüngliche Message -----
> > > > Von: Michael Segel <michael_segel@hotmail.com>
> > > > An: user@hbase.apache.org
> > > > CC:
> > > > Gesendet: 13:52 Mittwoch, 1.August 2012
> > > > Betreff: Re: How to query by rowKey-infix
> > > >
> > > > Actually w coprocessors you can create a secondary index in short
> > order.
> > > > Then your cost is going to be 2 fetches. Trying to do a partial table
> > > scan
> > > > will be more expensive.
> > > >
> > > > On Jul 31, 2012, at 12:41 PM, Matt Corgan <mcorgan@hotpads.com>
> wrote:
> > > >
> > > > > When deciding between a table scan vs secondary index, you should
> try
> > > to
> > > > > estimate what percent of the underlying data blocks will be used
in
> > the
> > > > > query.  By default, each block is 64KB.
> > > > >
> > > > > If each user's data is small and you are fitting multiple users per
> > > > block,
> > > > > then you're going to need all the blocks, so a tablescan is better
> > > > because
> > > > > it's simpler.  If each user has 1MB+ data then you will want to
> pick
> > > out
> > > > > the individual blocks relevant to each date.  The secondary index
> > will
> > > > help
> > > > > you go directly to those sparse blocks, but with a cost in
> > complexity,
> > > > > consistency, and extra denormalized data that knocks primary data
> out
> > > of
> > > > > your block cache.
> > > > >
> > > > > If latency is not a concern, I would start with the table scan. 
If
> > > > that's
> > > > > too slow you add the secondary index, and if you still need it
> faster
> > > you
> > > > > do the primary key lookups in parallel as Jerry mentions.
> > > > >
> > > > > Matt
> > > > >
> > > > > On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <chilinglam@gmail.com>
> > > > wrote:
> > > > >
> > > > >> Hi Chris:
> > > > >>
> > > > >> I'm thinking about building a secondary index for primary key
> > lookup,
> > > > then
> > > > >> query using the primary keys in parallel.
> > > > >>
> > > > >> I'm interested to see if there is other option too.
> > > > >>
> > > > >> Best Regards,
> > > > >>
> > > > >> Jerry
> > > > >>
> > > > >> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <
> > > > syrious3000@yahoo.de
> > > > >>> wrote:
> > > > >>
> > > > >>> Hello there,
> > > > >>>
> > > > >>> I designed a row key for queries that need best performance
(~100
> > ms)
> > > > >>> which looks like this:
> > > > >>>
> > > > >>> userId-date-sessionId
> > > > >>>
> > > > >>> These queries(scans) are always based on a userId and sometimes
> > > > >>> additionally on a date, too.
> > > > >>> That's no problem with the key above.
> > > > >>>
> > > > >>> However, another kind of queries shall be based on a given
time
> > range
> > > > >>> whereas the outermost left userId is not given or known.
> > > > >>> In this case I need to get all rows covering the given time
range
> > > with
> > > > >>> their date to create a daily reporting.
> > > > >>>
> > > > >>> As I can't set wildcards at the beginning of a left-based
index
> for
> > > the
> > > > >>> scan,
> > > > >>> I only see the possibility to scan the index of the whole
table
> to
> > > > >> collect
> > > > >>> the
> > > > >>> rowKeys that are inside the timerange I'm interested in.
> > > > >>>
> > > > >>> Is there a more elegant way to collect rows within time range
X?
> > > > >>> (Unfortunately, the date attribute is not equal to the timestamp
> > that
> > > > is
> > > > >>> stored by hbase automatically.)
> > > > >>>
> > > > >>> Could/should one maybe leverage some kind of row key caching
to
> > > > >> accelerate
> > > > >>> the collection process?
> > > > >>> Is that covered by the block cache?
> > > > >>>
> > > > >>> Thanks in advance for any advice.
> > > > >>>
> > > > >>> regards
> > > > >>> Chris
> > > > >>>
> > > > >>
> > > >
> > >
> > >
> > >
> > > --
> > > Alex Baranau
> > > ------
> > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> ElasticSearch
> > -
> > > Solr
> > >
> >
>
>
>
> --
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>


Mime
View raw message