Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C64E9DFD7 for ; Tue, 14 Aug 2012 03:11:55 +0000 (UTC) Received: (qmail 22047 invoked by uid 500); 14 Aug 2012 03:11:47 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 21401 invoked by uid 500); 14 Aug 2012 03:11:46 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 20833 invoked by uid 99); 14 Aug 2012 03:11:45 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Aug 2012 03:11:45 +0000 Received: from localhost (HELO mail-yx0-f169.google.com) (127.0.0.1) (smtp-auth username apurtell, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Aug 2012 03:11:45 +0000 Received: by yenl1 with SMTP id l1so4771173yen.14 for ; Mon, 13 Aug 2012 20:11:44 -0700 (PDT) MIME-Version: 1.0 Received: by 10.50.163.3 with SMTP id ye3mr8933995igb.24.1344913904203; Mon, 13 Aug 2012 20:11:44 -0700 (PDT) Received: by 10.64.19.170 with HTTP; Mon, 13 Aug 2012 20:11:44 -0700 (PDT) In-Reply-To: <1344904977.81716.YahooMailNeo@web121706.mail.ne1.yahoo.com> References: <1344894549.45473.YahooMailNeo@web126003.mail.ne1.yahoo.com> <1344904977.81716.YahooMailNeo@web121706.mail.ne1.yahoo.com> Date: Mon, 13 Aug 2012 20:11:44 -0700 Message-ID: Subject: Re: Secondary indexes suggestions From: Andrew Purtell To: user@hbase.apache.org, lars hofhansl Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Please pardon while I ramble, this started off as a short response and is now... lengthy. I've also seen Megastore-inspired secondary index implementations that clone the data from the primary table into the secondary table, by sort order of the attribute that is indexed. In Megastore this was configurable on a per index table basis: "Accessing entity data through indexes is normally a two-step process: first the index is read to fi=0Cnd matching primary keys, then these keys are used to fetch entities. We provide a way to denormalize portions of entity data directly into index entries. By adding the STORING clause to an index [...]" A naive implementation of this for HBase will require consistency checking of the index table(s) because it is easy for the denormalized data to become stale in some places if a client (or coprocessor) fails mid-write or if the index update is significantly delayed from the primary table update. A non-naive implementation will have some difficult to implement correctly Paxos-ish commit protocol doubly difficult to make perform well. Without that extra layer, it is assured the index is always slightly out of date. The lag can increase substantially if index region(s) are in transition when the primary table write happens, and then the client (or coprocessor) has to wait to update the index. You could also do this in reverse, update the index table first. Either the client would have to do this as Lars says, or a background MapReduce based process might be employed, or both. Without denormalization then you have the possibility of dangling pointers in the index tables, or data in the primary table that is not fully indexed. Also these cases would have to be found and fixed, the secondary index could potentially always be in some slight state of disrepair. CCIndex (https://github.com/Jia-Liu/CCIndex) was a scheme such as the above that also reduced the replication factor of denormalized index tables to soften the storage impact of the data duplication, and patched HBase core to regenerate the index table from the primary table if one of the index HFiles became corrupt. This is a questionable idea in my opinion, but it does lead to the interesting consideration if HBase should support trapping HFile IOEs to enable this sort of thing to be built as a coprocessor. A secondary indexing coprocessor could force the colocation of regions of a primary table with the regions of index tables that map back to them. Cross-region transactions are possible within a single RegionServer. A MasterObserver could control region placement. A RegionObserver on the region of the primary table could transact with those on the regions of the index tables. A WALObserver could group the update to the primary table and indexes into a single WAL entry. Should the RegionServer crash mid transaction, all updates would be replayed from the WAL, maintaining at all times the consistency of the index(es) with respect to the primary table. But I see a number of challenges with this. Foremost, now your availability concerns are not limited to the regions of the primary table possibly being in transition, now updates to the primary table would need to block until all relevant index regions are migrated over to where the primary region are resident. It may be worth trying to do something like this, but evicting regions to make room for colocation of index table regions with primary table regions could get out of hand. After a couple of RegionServers fail, perhaps quickly after each other, would the cluster converge to full availability? Would have to be extensively tested. The above is a fair amount of (over)engineering for where the client should be oblivious to how secondary indexing is done on the cluster. If that is not a design constraint, then HBase 0.94+ has limited cross row atomicity, within a single region. So if you are able to construct primary record keys and index record keys such as they will all fall within the keyspace of a single region, then this can be done today, the client can send them up as a group packed into a single RPC and be assured of server side atomic commit. However, doing such keyspace engineering while also aiming for efficient queries could be a big challenge. On Mon, Aug 13, 2012 at 5:42 PM, lars hofhansl wrote: > Secondary indexes are only simple when you ignore concurrent updates and = failing clients. > A client could manage to write the index first and then fail in the main = row (that can be handled by always rechecking the main row and always scan = all versions of the index rows, which is hard/expensive in a scan). > You can also have a WAL, which you check upon each read and reapply all o= utstanding changes. (2ndary index updates are nice in that they are idempot= ent). > > Similarly there are other scenarios that make this hard, and is the reaso= n why HBase doesn't have them. > We've been thinking about primitives to add to HBase to make building/usi= ng of 2ndary indexes easier/feasible. > > Should indexes be global (i.e. it is up to a client or coprocessor to gat= her then matches and requery the actual rows)? Or local (which means a quer= y needs to farm many queries in parallel to all index sites)? > Both have pros and cons. > > I think the key of Fuzzy filter is that it can actually seek ahead (using= the HBase Filter seek hints), which has the potential to be far more effic= ient than a full scan. > In fact local indexes would probably implemented that way: You always sca= n the main table and use the index information seek ahead. > > Just my $0.02, though. :) > > -- Lars Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)