Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
MIME-Version: 1.0
In-Reply-To: <1344904977.81716.YahooMailNeo@web121706.mail.ne1.yahoo.com>
References: 
 <CA+ku1NEqR-SrzV8QaRgh+CkjmwTCynWO4T6eU_umqFPggKfppw@mail.gmail.com>
	<1344894549.45473.YahooMailNeo@web126003.mail.ne1.yahoo.com>
	<BLU0-SMTP28393A96A70B70735E0EFA88FB70@phx.gbl>
	<1344904977.81716.YahooMailNeo@web121706.mail.ne1.yahoo.com>
Date: Mon, 13 Aug 2012 20:11:44 -0700
Message-ID: 
 <CA+RK=_C+b1NguO4fiT-0x5d33MyDG_we6Fw5eHOVf-5WYaG17Q@mail.gmail.com>
Subject: Re: Secondary indexes suggestions
From: Andrew Purtell <apurtell@apache.org>
To: user@hbase.apache.org, lars hofhansl <lhofhansl@yahoo.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Please pardon while I ramble, this started off as a short response and
is now... lengthy.

I've also seen Megastore-inspired secondary index implementations that
clone the data from the primary table into the secondary table, by
sort order of the attribute that is indexed. In Megastore this was
configurable on a per index table basis: "Accessing entity data
through indexes is normally a two-step process: first the index is
read to fi=0Cnd matching primary keys, then these keys are used to fetch
entities. We provide a way to denormalize portions of entity data
directly into index entries. By adding the STORING clause to an index
[...]" A naive implementation of this for HBase will require
consistency checking of the index table(s) because it is easy for the
denormalized data to become stale in some places if a client (or
coprocessor) fails mid-write or if the index update is significantly
delayed from the primary table update. A non-naive implementation will
have some difficult to implement correctly Paxos-ish commit protocol
doubly difficult to make perform well. Without that extra layer, it is
assured the index is always slightly out of date. The lag can increase
substantially if index region(s) are in transition when the primary
table write happens, and then the client (or coprocessor) has to wait
to update the index. You could also do this in reverse, update the
index table first. Either the client would have to do this as Lars
says, or a background MapReduce based process might be employed, or
both.

Without denormalization then you have the possibility of dangling
pointers in the index tables, or data in the primary table that is not
fully indexed. Also these cases would have to be found and fixed, the
secondary index could potentially always be in some slight state of
disrepair.

CCIndex (https://github.com/Jia-Liu/CCIndex) was a scheme such as the
above that also reduced the replication factor of denormalized index
tables to soften the storage impact of the data duplication, and
patched HBase core to regenerate the index table from the primary
table if one of the index HFiles became corrupt. This is a
questionable idea in my opinion, but it does lead to the interesting
consideration if HBase should support trapping HFile IOEs to enable
this sort of thing to be built as a coprocessor.

A secondary indexing coprocessor could force the colocation of regions
of a primary table with the regions of index tables that map back to
them. Cross-region transactions are possible within a single
RegionServer. A MasterObserver could control region placement. A
RegionObserver on the region of the primary table could transact with
those on the regions of the index tables. A WALObserver could group
the update to the primary table and indexes into a single WAL entry.
Should the RegionServer crash mid transaction, all updates would be
replayed from the WAL, maintaining at all times the consistency of the
index(es) with respect to the primary table. But I see a number of
challenges with this. Foremost, now your availability concerns are not
limited to the regions of the primary table possibly being in
transition, now updates to the primary table would need to block until
all relevant index regions are migrated over to where the primary
region are resident. It may be worth trying to do something like this,
but evicting regions to make room for colocation of index table
regions with primary table regions could get out of hand. After a
couple of RegionServers fail, perhaps quickly after each other, would
the cluster converge to full availability? Would have to be
extensively tested.

The above is a fair amount of (over)engineering for where the client
should be oblivious to how secondary indexing is done on the cluster.
If that is not a design constraint, then HBase 0.94+ has limited cross
row atomicity, within a single region. So if you are able to construct
primary record keys and index record keys such as they will all fall
within the keyspace of a single region, then this can be done today,
the client can send them up as a group packed into a single RPC and be
assured of server side atomic commit. However, doing such keyspace
engineering while also aiming for efficient queries could be a big
challenge.

On Mon, Aug 13, 2012 at 5:42 PM, lars hofhansl <lhofhansl@yahoo.com> wrote:
> Secondary indexes are only simple when you ignore concurrent updates and =
failing clients.
> A client could manage to write the index first and then fail in the main =
row (that can be handled by always rechecking the main row and always scan =
all versions of the index rows, which is hard/expensive in a scan).
> You can also have a WAL, which you check upon each read and reapply all o=
utstanding changes. (2ndary index updates are nice in that they are idempot=
ent).
>
> Similarly there are other scenarios that make this hard, and is the reaso=
n why HBase doesn't have them.
> We've been thinking about primitives to add to HBase to make building/usi=
ng of 2ndary indexes easier/feasible.
>
> Should indexes be global (i.e. it is up to a client or coprocessor to gat=
her then matches and requery the actual rows)? Or local (which means a quer=
y needs to farm many queries in parallel to all index sites)?
> Both have pros and cons.
>
> I think the key of Fuzzy filter is that it can actually seek ahead (using=
 the HBase Filter seek hints), which has the potential to be far more effic=
ient than a full scan.
> In fact local indexes would probably implemented that way: You always sca=
n the main table and use the index information seek ahead.
>
> Just my $0.02, though. :)
>
> -- Lars

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)