Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9F3999FC6 for ; Tue, 17 Jan 2012 22:14:33 +0000 (UTC) Received: (qmail 41140 invoked by uid 500); 17 Jan 2012 22:14:32 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 41024 invoked by uid 500); 17 Jan 2012 22:14:31 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 41016 invoked by uid 99); 17 Jan 2012 22:14:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Jan 2012 22:14:31 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mcsrivas@gmail.com designates 209.85.214.169 as permitted sender) Received: from [209.85.214.169] (HELO mail-tul01m020-f169.google.com) (209.85.214.169) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Jan 2012 22:14:25 +0000 Received: by obbta7 with SMTP id ta7so4311649obb.14 for ; Tue, 17 Jan 2012 14:14:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=voTBgF4YMUtXNyYrfrqr4IzMQgCeMZ1pz0nN7rFjTBQ=; b=Je7fOIgDsElsWBcfMhbV0dybrD42ScH6uiwMnvMefDrvCGKAbtqSff/zZZkhdQxLfP +ZmU/Q7h/+jB8ujjk7HoBN9N6bP/vPcK+GNHNggPq4vM+8Jg3RLzspsptFkIE4bdtjy0 MLZH7BOiocXBrDtRheBXyRH2YdVlNc1Xd7bmU= MIME-Version: 1.0 Received: by 10.182.109.106 with SMTP id hr10mr16784585obb.27.1326838444413; Tue, 17 Jan 2012 14:14:04 -0800 (PST) Received: by 10.182.43.70 with HTTP; Tue, 17 Jan 2012 14:14:04 -0800 (PST) In-Reply-To: <1326823659.36452.YahooMailNeo@web121702.mail.ne1.yahoo.com> References: <1326684100.80142.YahooMailNeo@web121706.mail.ne1.yahoo.com> <1326823659.36452.YahooMailNeo@web121702.mail.ne1.yahoo.com> Date: Tue, 17 Jan 2012 14:14:04 -0800 Message-ID: Subject: Re: Delete client API. From: "M. C. Srivas" To: lars hofhansl Cc: "dev@hbase.apache.org" Content-Type: multipart/alternative; boundary=f46d0444ee5527529004b6c0a6b4 --f46d0444ee5527529004b6c0a6b4 Content-Type: text/plain; charset=ISO-8859-1 On Tue, Jan 17, 2012 at 10:07 AM, lars hofhansl wrote: > Yeah, it's confusing if one expects it to work like in a relational > database. > You can even do worse. If you by accident place a delete in the future all > current inserts will be hidden until the next major compaction. :) > I got confused about this myself just recently (see my mail on the > dev-list). > > > In the end this is a pretty powerful feature and core to how HBase works > (not saying that is not confusing though). > > > If one keeps the following two points in mind it makes more sense: > 1. Delete just sets a tomb stone marker at a specific TS (marking > everything older as deleted). > 2. Everything is versioned, if no version is specified the current time > (at the regionserver) is used. > > In your example1 below t3 > 6, hence the insert is hidden. > In example2 both delete and insert TS are 6, hence the insert is hidden. > Lets consider my example2 for a little longer. Sequence of events 1. ins val1 with TS=6 set by client 2. del entire row at TS=6 set by client 3. ins val2 with TS=6 set by client 4. read row The row returns nothing even though the insert at step 3 happened after the delete at step 2. (step 2 masks even future inserts) Now, the same sequence with a compaction thrown in the middle: 1. ins val1 with TS=6 set by client 2. del entire row at TS=6 set by client 3. ---- table is compacted ----- 4. ins val2 with TS=6 set by client 5. read row The row returns val2. (the delete at step2 got lost due to compaction). So we have different results depending upon whether an internal re-organization (like a compaction) happened or not. If we want both sequences to behave exactly the same, then we need to first choose what is the proper (and deterministic) behavior. A. if we think that the first sequence is the correct one, then the delete at step 2 needs to be preserved forever. or, B. if we think that the second sequence is the correct behavior (ie, a read always produces the same results independent of compaction), then the record needs a second "internal TS" field to allow the RS to distinguish the real sequence of events, and not rely upon the TS field which is settable by the client. My opinion: We should do B. It is normal for someone to write code that says "if old exists, delete it; add new". A subsequent read should always reliably return "new". The current way of relying on a client-settable TS field to determine causal order results in quirky behavior, and quirky is not good. > Look at these two examples: > > 1. insert Val1 at real time t1 > 2. at real time t2 > t1 > 3. insert Val2 at real time t3 > t2 > > 1. insert Val1 with TS=1 at real time t1 > 2. with TS = 2 at real time t2 > t1 > > 3. insert Val2 with TS = 3 at real time t3 > t2 > > > In both cases Val2 is visible. > > If the your code sets your own timestamps, you better know what you're > doing :) > > Note that my examples below are confusing even if you know how deletion in > HBase works. > You have to look at Delete.java to figure out what is happening. > OK, since there were know objections in two days, I will commit my > proposed change in HBASE-5205. > > > -- Lars > > ________________________________ > From: M. C. Srivas > To: dev@hbase.apache.org; lars hofhansl > Sent: Tuesday, January 17, 2012 8:13 AM > Subject: Re: Delete client API. > > > Delete seems to be confusing in general. Here are some examples that make > me scratch my head (key is same in all the examples): > > Example1: > ---------------- > 1. insert Val3 with TS=3 at real time t1 > 2. insert Val5 with TS=5 at real time t2 > t1 > 3. at real time t3 > t2 > 4. insert Val6 with TS=6 at real time t4 > t3 > > What does a read return? (I would expect Val6, since it was done last). > But depending upon whether compaction happened or not between steps 3 and > 4, I get either Val6 or nothing. > > Example 2: > ----------------- > 1. insert Val3 with TS=3 at real time t1 > 2. insert Val5 with TS=5 at real time t2 > t1 > 3. TS=6 at real time t3 > t2 > 4. insert Val6 with TS=6 at real time t4 > t3 > > Note the difference in step 3 is this time a TS was specified by the > client. > > What does a read return? Again, I expect Val6 to be returned. But > depending upon what's going on, I seem to get either Val5 or Val6. > > > > > > On Sun, Jan 15, 2012 at 7:21 PM, lars hofhansl > wrote: > > There are some confusing parts about the Delete client API: > >1. calling deleteFamily removes all prior column or columns markers > without checking the TS. > >2. delete{Column|Columns|Family} do not use the timestamp passed to > Delete at construction time, but instead default to LATEST_TIMESTAMP. > > > > Delete d = new Delete(R,T); > > d.deleteFamily(CF); > > > >Does not do what you expect (won't use T for the family delete, but > rather the current time). > > > >Neither does > > d.deleteColumns(CF, C1, T2); > > d.deleteFamily(CF, T1); // T1 < T2 > > > > > >(the columns marker will be removed) > > > > > >#1 prevents Delete from adding a family marker F for time T1 and a > column/columns marker for columns of F at T2 even if T2 > T1. > >#2 is just unexpected and different from what Put is doing. > > > >In HBASE-5205 I propose a simple patch to fix this. > > > >Since this is a (slight) API change, please provide feed back. > > > >Thanks. > > > >-- Lars > > > > > --f46d0444ee5527529004b6c0a6b4--