hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: Help in designing row key
Date Wed, 03 Jul 2013 09:14:10 GMT
Yeah, I was thinking to use a normalization step in order to allow the use
of FuzzyRowFilter but what is not clear to me is if integers must also be
normalized or not.
I will explain myself better. Suppose that i follow your advice and I
produce keys like:
 - 1|1|somehash|sometimestamp
 - 55|555|somehash|sometimestamp

Whould they match the same pattern or do I have to normalize them to the
following?
 - 001|001|somehash|sometimestamp
 - 055|555|somehash|sometimestamp

Moreover, I noticed that you used dots ('.') to separate things instead of
pipe ('|')..is there a reason for that (maybe performance or whatever) or
is just your favourite separator?

Best,
Flavio


On Wed, Jul 3, 2013 at 10:12 AM, Mike Axiak <mike@axiak.net> wrote:

> I'm not sure if you're eliding this fact or not, but you'd be much
> better off if you used a fixed-width format for your keys. So in your
> example, you'd have:
>
> PATTERN: source(4-byte-int).type(4-byte-int or smaller).fixed 128-bit
> hash.8-byte timestamp
>
> Example: \x00\x00\x00\x01\x00\x00\x02\x03....
>
> The advantage of this is not only that it's significantly less data
> (remember your key is stored on each KeyValue), but also you can now
> use FuzzyRowFilter and other techniques to quickly perform scans. The
> disadvantage is that you have to normalize the source-> integer but I
> find I can either store that in an enum or cache it for a long time so
> it's not a big issue.
>
> -Mike
>
> On Wed, Jul 3, 2013 at 4:05 AM, Flavio Pompermaier <pompermaier@okkam.it>
> wrote:
> > Thank you very much for the great support!
> > This is how I thought to design my key:
> >
> > PATTERN: source|type|qualifier|hash(name)|timestamp
> > EXAMPLE:
> > google|appliance|oven|be9173589a7471a7179e928adc1a86f7|1372837702753
> >
> > Do you think my key could be good for my scope (my search will be
> > essentially by source or source|type)?
> > Another point is that initially I will not have so many sources, so I
> will
> > probably have only google|* but in the next phases there could be more
> > sources..
> >
> > Best,
> > Flavio
> >
> > On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> >> For #1, yes - the client receives less data after filtering.
> >>
> >> For #2, please take a look at TestMultiVersions
> >> (./src/test/java/org/apache/hadoop/hbase/TestMultiVersions.java in 0.94)
> >> for time range:
> >>
> >>     scan = new Scan();
> >>
> >>     scan.setTimeRange(1000L, Long.MAX_VALUE);
> >> For row key selection, you need a filter. Take a look at
> >> FuzzyRowFilter.java
> >>
> >> Cheers
> >>
> >> On Tue, Jul 2, 2013 at 10:35 AM, Flavio Pompermaier <
> pompermaier@okkam.it
> >> >wrote:
> >>
> >> >  Thanks for the reply! I thus have two questions more:
> >> >
> >> > 1) is it true that filtering on timestamps doesn't affect
> performance..?
> >> > 2) could you send me a little snippet of how you would do such a
> filter
> >> (by
> >> > row key + timestamps)? For example get all rows whose key starts with
> >> > 'someid-' and whose timestamps is greater than some timestamp?
> >> >
> >> > Best,
> >> > Flavio
> >> >
> >> >
> >> > On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> >> >
> >> > > bq. Using timestamp in row-keys is discouraged
> >> > >
> >> > > The above is true.
> >> > > Prefixing row key with timestamp would create hot region.
> >> > >
> >> > > bq. should I filter by a simpler row-key plus a filter on timestamp?
> >> > >
> >> > > You can do the above.
> >> > >
> >> > > On Tue, Jul 2, 2013 at 9:13 AM, Flavio Pompermaier <
> >> pompermaier@okkam.it
> >> > > >wrote:
> >> > >
> >> > > > Hi to everybody,
> >> > > >
> >> > > > in my use case I have to perform batch analysis skipping old
data.
> >> > > > For example, I want to process all rows created after a certain
> >> > > timestamp,
> >> > > > passed as parameter.
> >> > > >
> >> > > > What is the most effective way to do this?
> >> > > > Should I design my row-key to embed timestamp?
> >> > > > Or just filtering by timestamp of the row is fast as well? Or
what
> >> > else?
> >> > > >
> >> > > > Initially I was thinking to compose my key as:
> >> > > > timestamp|source|title|type
> >> > > >
> >> > > > but:
> >> > > >
> >> > > > 1) Using timestamp in row-keys is discouraged
> >> > > > 2) If this design is ok, using this approach I still have problems
> >> > > > filtering by timestamp because I cannot found a way to numerically
> >> > filer
> >> > > > (instead of alphanumerically/by string). Example:
> >> > > > 1372776400441|something has timestamp lesser
> >> > > > than 1372778470913|somethingelse but I cannot filter all row
whose
> >> key
> >> > is
> >> > > > "numerically" greater than 1372776400441. Is it possible to
> overcome
> >> > this
> >> > > > issue?
> >> > > > 3) If this design is not ok, should I filter by a simpler row-key
> >> plus
> >> > a
> >> > > > filter on timestamp? Or what else?
> >> > > >
> >> > > > Best,
> >> > > > Flavio
> >> > > >
> >> > >
> >> >
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message