hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tobe <tobeg3oo...@gmail.com>
Subject Re: Should scan check the limitation of the number of versions?
Date Tue, 26 Aug 2014 01:13:06 GMT
@lars I have set {KEEP_DELETED_CELLS => 'false'} in that table. I will get
the same result before manually running `flush`. You can try the commands I
gave and it's 100% repro.


On Tue, Aug 26, 2014 at 2:20 AM, lars hofhansl <larsh@apache.org> wrote:

> Queries of past time ranges only work correctly when KEEP_DELETED_CELLS is
> enabled for the column families.
>
>
> ________________________________
>  From: tobe <tobeg3oogle@gmail.com>
> To: hbase-dev <dev@hbase.apache.org>
> Cc: "user@hbase.apache.org" <user@hbase.apache.org>
> Sent: Monday, August 25, 2014 4:32 AM
> Subject: Re: Should scan check the limitation of the number of versions?
>
>
> I haven't read the code deeply but I have an idea(not sure whether it's
> right or not). When we scan the the columns, we will skip the one which
> doesn't match(deleted). Can we use a counter to record this? For each skip,
> we add one until it reaches the restrictive number of versions. But we have
> to consider mvcc and others, which seems more complex.
>
>
>
>
>
> On Mon, Aug 25, 2014 at 5:54 PM, tobe <tobeg3oogle@gmail.com> wrote:
>
> > So far, I have found two problems about this.
> >
> > Firstly, HBase-11675 <https://issues.apache.org/jira/browse/HBASE-11675
> >.
> > It's a little tricky and rarely happens. But it asks users to be careful
> of
> > compaction which occurs on server side. They may get different results
> > before and after the major compaction.
> >
> > Secondly, if you put a value with timestamp 100, then put another value
> on
> > the same column with timestamp 200. Here we set the number of version as
> 1.
> > So when we get the value of this column, we will get the latest one with
> > timestamp 200 and that's right. But if I get with a timerange form 0 to
> > 150, I may get the first value with timestamp 100 before compaction
> > happens. And after compaction happens, you will never get this value even
> > you run the same command.
> >
> > It's easy to repro, follow this steps:
> > hbase(main):001:0> create "table", "cf"
> > hbase(main):003:0> put "table", "row1", "cf:a", "value1", 100
> > hbase(main):003:0> put "table", "row1", "cf:a", "value1", 200
> > hbase(main):026:0> get "table", "row1", {TIMERANGE => [0, 150]}  //
> before
> > flush
> >    row1      column=cf:a, timestamp=100, value=value1
> > hbase(main):060:0> flush "table"
> > hbase(main):082:0> get "table", "row1", {TIMERANGE => [0, 150]}  // after
> > flush
> >    0 row(s) in 0.0050 seconds
> >
> > I think the reason of that is we have three restriction to remove data:
> > delete, ttl and versions. Any time we get or scan the data, we will check
> > the delete mark and ttl to make sure it will not return to users. But for
> > versions, we don't check this limitation. Our output relies on the
> > compaction to cleanup the overdue data. Is it possible to add this
> > condition within scan(get is implemented as scan)?
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message