hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: Help in designing row key
Date Wed, 03 Jul 2013 08:05:16 GMT
Thank you very much for the great support!
This is how I thought to design my key:

PATTERN: source|type|qualifier|hash(name)|timestamp
EXAMPLE:
google|appliance|oven|be9173589a7471a7179e928adc1a86f7|1372837702753

Do you think my key could be good for my scope (my search will be
essentially by source or source|type)?
Another point is that initially I will not have so many sources, so I will
probably have only google|* but in the next phases there could be more
sources..

Best,
Flavio

On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> For #1, yes - the client receives less data after filtering.
>
> For #2, please take a look at TestMultiVersions
> (./src/test/java/org/apache/hadoop/hbase/TestMultiVersions.java in 0.94)
> for time range:
>
>     scan = new Scan();
>
>     scan.setTimeRange(1000L, Long.MAX_VALUE);
> For row key selection, you need a filter. Take a look at
> FuzzyRowFilter.java
>
> Cheers
>
> On Tue, Jul 2, 2013 at 10:35 AM, Flavio Pompermaier <pompermaier@okkam.it
> >wrote:
>
> >  Thanks for the reply! I thus have two questions more:
> >
> > 1) is it true that filtering on timestamps doesn't affect performance..?
> > 2) could you send me a little snippet of how you would do such a filter
> (by
> > row key + timestamps)? For example get all rows whose key starts with
> > 'someid-' and whose timestamps is greater than some timestamp?
> >
> > Best,
> > Flavio
> >
> >
> > On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> > > bq. Using timestamp in row-keys is discouraged
> > >
> > > The above is true.
> > > Prefixing row key with timestamp would create hot region.
> > >
> > > bq. should I filter by a simpler row-key plus a filter on timestamp?
> > >
> > > You can do the above.
> > >
> > > On Tue, Jul 2, 2013 at 9:13 AM, Flavio Pompermaier <
> pompermaier@okkam.it
> > > >wrote:
> > >
> > > > Hi to everybody,
> > > >
> > > > in my use case I have to perform batch analysis skipping old data.
> > > > For example, I want to process all rows created after a certain
> > > timestamp,
> > > > passed as parameter.
> > > >
> > > > What is the most effective way to do this?
> > > > Should I design my row-key to embed timestamp?
> > > > Or just filtering by timestamp of the row is fast as well? Or what
> > else?
> > > >
> > > > Initially I was thinking to compose my key as:
> > > > timestamp|source|title|type
> > > >
> > > > but:
> > > >
> > > > 1) Using timestamp in row-keys is discouraged
> > > > 2) If this design is ok, using this approach I still have problems
> > > > filtering by timestamp because I cannot found a way to numerically
> > filer
> > > > (instead of alphanumerically/by string). Example:
> > > > 1372776400441|something has timestamp lesser
> > > > than 1372778470913|somethingelse but I cannot filter all row whose
> key
> > is
> > > > "numerically" greater than 1372776400441. Is it possible to overcome
> > this
> > > > issue?
> > > > 3) If this design is not ok, should I filter by a simpler row-key
> plus
> > a
> > > > filter on timestamp? Or what else?
> > > >
> > > > Best,
> > > > Flavio
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message