Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of anilgupta84@gmail.com
 designates 209.85.213.169 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <1344545723.72438.YahooMailNeo@web171501.mail.ir2.yahoo.com>
References: <1343748460.12346.YahooMailNeo@web171503.mail.ir2.yahoo.com>
 <CAG+ckK_tajNHuAB+G-2u3yJoSPq+YZUagFGnHC8dMsd1q7QXVg@mail.gmail.com>
 <CAOKsKJWR48Cz-P2zrec1ebQ2J14UDLgjOdYeaJwDgB_SGuN7sw@mail.gmail.com>
 <BLU0-SMTP601F4A647E5D974B2DDE718FC40@phx.gbl>
 <1343910199.66686.YahooMailNeo@web171503.mail.ir2.yahoo.com>
 <1343910525.89654.YahooMailNeo@web171501.mail.ir2.yahoo.com>
 <1343911211.32055.YahooMailNeo@web171502.mail.ir2.yahoo.com>
 <CAA7+SiBZ__zfQPrYJktMN8zSHqW4zWKxb0tjCOCd7MHk7_1i4Q@mail.gmail.com>
 <1343985814.76734.YahooMailNeo@web171505.mail.ir2.yahoo.com>
 <CAA7+SiDnT4UzQxb7MSn1t=oT113DVOQzT1ySiU2NmErV3kjMvQ@mail.gmail.com>
 <CAA7+SiBueZicrOr1FvBvjNUssVrzAkDAOZ8TPpzD1NK=oLytqQ@mail.gmail.com>
 <1344545513.36241.YahooMailNeo@web171503.mail.ir2.yahoo.com>
 <1344545723.72438.YahooMailNeo@web171501.mail.ir2.yahoo.com>
From: anil gupta <anilgupta84@gmail.com>
Date: Wed, 22 Aug 2012 11:42:57 -0700
Message-ID: 
 <CAF1+Vs9Pv=cD4Nv=zNz--g38-uBV7hSxGVU8c98L3C5BXROBfA@mail.gmail.com>
Subject: Re: How to query by rowKey-infix
To: user@hbase.apache.org,
 =?ISO-8859-1?Q?Christian_Sch=E4fer?= <syrious3000@yahoo.de>
Content-Type: multipart/alternative; boundary=14dae9340f5bcdcf6a04c7df1d1f

--14dae9340f5bcdcf6a04c7df1d1f
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Christian,

I had the similar requirements as yours. So, till now i have used
timestamps for filtering the data and I would say the performance is
satisfactory. Here are the results of timestamp based filtering:
The table has 34 million records(average row size is 1.21 KB), in 136
seconds i get the entire result of query which had 225 rows.
I am running a HBase 0.92, 8 node cluster on Vmware Hypervisor. Each node
had 3.2 GB of memory, and 500 GB HDFS space. Each Hard Drive in my set-up
is hosting 2 Slaves Instance(2 VM's running Datanode,
NodeManager,RegionServer). I have only allocated 1200MB for RS's. I haven't
done any modification in the block size of HDFS or HBase. Considering the
below-par hardware configuration of cluster i feel the performance is OK
and IMO it'll be better than substring comparator of column values since in
substring comparator filter you are essentially doing a FULL TABLE scan.
Whereas, in timerange based scan you can *Skip Store Files*.

On a side note, Alex created a JIRA for enhancing the current
FuzzyRowFilter to do range based filtering also. Here is the link:
https://issues.apache.org/jira/browse/HBASE-6618 . You are more than
welcome if you would like to chime in.

HTH,
Anil Gupta


On Thu, Aug 9, 2012 at 1:55 PM, Christian Sch=E4fer <syrious3000@yahoo.de>w=
rote:

> Nice. Thanks Alex for sharing your experiences with that custom filter
> implementation.
>
>
> Currently I'm still using key filter with substring comparator.
> As soon as I got a good amount of test data I will measure performance of
> that naiive substring filter in comparison to your fuzzy row filter.
>
> regards,
> Christian
>
>
>
> ________________________________
> Von: Alex Baranau <alex.baranov.v@gmail.com>
> An: user@hbase.apache.org; Christian Sch=E4fer <syrious3000@yahoo.de>
> Gesendet: 22:18 Donnerstag, 9.August 2012
> Betreff: Re: How to query by rowKey-infix
>
>
> jfyi: documented FuzzyRowFilter usage here: http://bit.ly/OXVdbg. Will
> add documentation to HBase book very soon [1]
>
> Alex Baranau
> ------
> Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
>
> [1] https://issues.apache.org/jira/browse/HBASE-6526
>
> On Fri, Aug 3, 2012 at 6:14 PM, Alex Baranau <alex.baranov.v@gmail.com>
> wrote:
>
> Good!
> >
> >
> >Submitted initial patch of fuzzy row key filter at
> https://issues.apache.org/jira/browse/HBASE-6509. You can just copy the
> filter class and include it in your code and use it in your setup as any
> other custom filter (no need to patch HBase).
> >
> >
> >Please let me know if you try it out (or post your comments at
> HBASE-6509).
> >
> >
> >Alex Baranau
> >------
> >Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Sol=
r
> >
> >
> >On Fri, Aug 3, 2012 at 5:23 AM, Christian Sch=E4fer <syrious3000@yahoo.d=
e>
> wrote:
> >
> >Hi Alex,
> >>
> >>thanks a lot for the hint about setting the timestamp of the put.
> >>I didn't know that this would be possible but that's solving the proble=
m
> (first test was successful).
> >>So I'm really glad that I don't need to apply a filter to extract the
> time and so on for every row.
> >>
> >>Nevertheless I would like to see your custom filter implementation.
> >>Would be nice if you could provide it helping me to get a bit into it.
> >>
> >>And yes that helped :)
> >>
> >>regards
> >>Chris
> >>
> >>
> >>
> >>________________________________
> >>Von: Alex Baranau <alex.baranov.v@gmail.com>
> >>An: user@hbase.apache.org; Christian Sch=E4fer <syrious3000@yahoo.de>
> >>Gesendet: 0:57 Freitag, 3.August 2012
> >>
> >>Betreff: Re: How to query by rowKey-infix
> >>
> >>
> >>Hi Christian!
> >>If to put off secondary indexes and assume you are going with "heavy
> scans", you can try two following things to make it much faster. If this =
is
> appropriate to your situation, of course.
> >>
> >>1.
> >>
> >>> Is there a more elegant way to collect rows within time range X?
> >>> (Unfortunately, the date attribute is not equal to the timestamp that
> is stored by hbase automatically.)
> >>
> >>Can you set timestamp of the Puts to the one you have in row key?
> Instead of relying on the one that HBase puts automatically (current ts).
> If you can, this will improve reading speed a lot by setting time range o=
n
> scanner. Depending on how you are writing your data of course, but I assu=
me
> that you mostly write data in "time-increasing" manner.
> >>
> >>
> >>2.
> >>
> >>If your userId has fixed length, or you can change it so that it has
> fixed length, then you can actually use smth like "wildcard"  in row key.
> There's a way in Filter implementation to fast-forward to the record with
> specific row key and by doing this skip many records. This might be used =
as
> follows:
> >>* suppose your userId is 5 characters in length
> >>* suppose you are scanning for records with time between 2012-08-01
> and 2012-08-08
> >>* when you scanning records and you face e.g. key
> "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell
> the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01".
> Because you know that all remained records of user "aaaaa" don't fall int=
o
> the interval you need (as the time for its records will be >=3D 2012-08-0=
9).
> >>
> >>As of now, I believe you will have to implement your custom filter to d=
o
> that.
> Pointer: org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING=
_HINT
> >>I believe I implemented similar thing some time ago. If this idea works
> for you I could look for the implementation and share it if it helps. Or
> may be even simply add it to HBase codebase.
> >>
> >>Hope this helps,
> >>
> >>
> >>Alex Baranau
> >>------
> >>Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> - Solr
> >>
> >>
> >>
> >>On Thu, Aug 2, 2012 at 8:40 AM, Christian Sch=E4fer <syrious3000@yahoo.=
de>
> wrote:
> >>
> >>
> >>>
> >>>Excuse my double posting.
> >>>Here is the complete mail:
> >>>
> >>>
> >>>
> >>>OK,
> >>>
> >>>at first I will try the scans.
> >>>
> >>>If that's too slow I will have to upgrade hbase (currently
> 0.90.4-cdh3u2) to be able to use coprocessors.
> >>>
> >>>
> >>>Currently I'm stuck at the scans because it requires two steps
> (therefore maybe some kind of filter chaining is required)
> >>>
> >>>
> >>>The key:  userId-dateInMillis-sessionId
> >>>
> >>>
> >>>At first I need to extract dateInMllis with regex or substring (using
> special delimiters for date)
> >>>
> >>>Second, the extracted value must be parsed to Long and set to a
> RowFilter Comparator like this:
> >>>
> >>>scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new
> BinaryComparator(Bytes.toBytes((Long)dateInMillis))));
> >>>
> >>>How to chain that?
> >>>Do I have to write a custom filter?
> >>>(Would like to avoid that due to deployment)
> >>>
> >>>regards
> >>>Chris
> >>>
> >>>
> >>>----- Urspr=FCngliche Message -----
> >>>Von: Michael Segel <michael_segel@hotmail.com>
> >>>An: user@hbase.apache.org
> >>>CC:
> >>>Gesendet: 13:52 Mittwoch, 1.August 2012
> >>>Betreff: Re: How to query by rowKey-infix
> >>>
> >>>Actually w coprocessors you can create a secondary index in short orde=
r.
> >>>Then your cost is going to be 2 fetches. Trying to do a partial table
> scan will be more expensive.
> >>>
> >>>On Jul 31, 2012, at 12:41 PM, Matt Corgan <mcorgan@hotpads.com> wrote:
> >>>
> >>>> When deciding between a table scan vs secondary index, you should tr=
y
> to
> >>>> estimate what percent of the underlying data blocks will be used in
> the
> >>>> query.  By default, each block is 64KB.
> >>>>
> >>>> If each user's data is small and you are fitting multiple users per
> block,
> >>>> then you're going to need all the blocks, so a tablescan is better
> because
> >>>> it's simpler.  If each user has 1MB+ data then you will want to pick
> out
> >>>> the individual blocks relevant to each date.  The secondary index
> will help
> >>>> you go directly to those sparse blocks, but with a cost in complexit=
y,
> >>>> consistency, and extra denormalized data that knocks primary data ou=
t
> of
> >>>> your block cache.
> >>>>
> >>>> If latency is not a concern, I would start with the table scan.  If
> that's
> >>>> too slow you add the secondary index, and if you still need it faste=
r
> you
> >>>> do the primary key lookups in parallel as Jerry mentions.
> >>>>
> >>>> Matt
> >>>>
> >>>> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <chilinglam@gmail.com>
> wrote:
> >>>>
> >>>>> Hi Chris:
> >>>>>
> >>>>> I'm thinking about building a secondary index for primary key
> lookup, then
> >>>>> query using the primary keys in parallel.
> >>>>>
> >>>>> I'm interested to see if there is other option too.
> >>>>>
> >>>>> Best Regards,
> >>>>>
> >>>>> Jerry
> >>>>>
> >>>>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Sch=E4fer <
> syrious3000@yahoo.de
> >>>>>> wrote:
> >>>>>
> >>>>>> Hello there,
> >>>>>>
> >>>>>> I designed a row key for queries that need best performance (~100
> ms)
> >>>>>> which looks like this:
> >>>>>>
> >>>>>> userId-date-sessionId
> >>>>>>
> >>>>>> These queries(scans) are always based on a userId and sometimes
> >>>>>> additionally on a date, too.
> >>>>>> That's no problem with the key above.
> >>>>>>
> >>>>>> However, another kind of queries shall be based on a given time
> range
> >>>>>> whereas the outermost left userId is not given or known.
> >>>>>> In this case I need to get all rows covering the given time range
> with
> >>>>>> their date to create a daily reporting.
> >>>>>>
> >>>>>> As I can't set wildcards at the beginning of a left-based index fo=
r
> the
> >>>>>> scan,
> >>>>>> I only see the possibility to scan the index of the whole table to
> >>>>> collect
> >>>>>> the
> >>>>>> rowKeys that are inside the timerange I'm interested in.
> >>>>>>
> >>>>>> Is there a more elegant way to collect rows within time range X?
> >>>>>> (Unfortunately, the date attribute is not equal to the timestamp
> that is
> >>>>>> stored by hbase automatically.)
> >>>>>>
> >>>>>> Could/should one maybe leverage some kind of row key caching to
> >>>>> accelerate
> >>>>>> the collection process?
> >>>>>> Is that covered by the block cache?
> >>>>>>
> >>>>>> Thanks in advance for any advice.
> >>>>>>
> >>>>>> regards
> >>>>>> Chris
> >>>>>>
> >>>>>
> >>>
> >>
> >>
> >>--
> >>
> >>Alex Baranau
> >>------
> >>Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> - Solr
> >>
> >
> >
> >
> >--
> >
> >Alex Baranau
> >------
> >Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> - Solr
> >
>


--=20
Thanks & Regards,
Anil Gupta

--14dae9340f5bcdcf6a04c7df1d1f--