phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <>
Subject Re: Secondary Index internals
Date Tue, 05 Jan 2016 03:00:17 GMT
Hi Nick,
Happy New Year to you too!

I'd be happy to give you an overview of secondary indexes. When you declare
a table with IMMUTABLE_ROWS=true, you're telling Phoenix that you have
write-once/append-only data (i.e. a row will never be updated in-place).
This makes the incremental maintenance of a secondary index much easer
because you simply need to generate the correct index row based on your
data row (note that with Phoenix, index table rows are 1:1 with data table
rows and index tables are just regular old HBase tables with a different
row key structure). This logic is encapsulated in
With mutable tables you need to do this too, but you also need to delete
the index row for current value which complicates things.

Because indexes over immutable data don't require additional processing
during incremental maintenance, we can maintain the index on the client
side by simply generating the index row when the data row is upserted (see and the addRowMutations method). The batch of Puts for
the index updates get send over after the batch of Puts for the data table
(see loop in send method). There's an optimization for local indexes,
however. Since we know that the index data resides on the same region
server as the table table, we can generate the index row through a
coprocessor on the server side instead of sending it over the wire.

Deletes for write-once data are handled differently, by generating the
equivalent SQL statement against the index table that was run against the
data table. The reason this is done is because we don't necessarily have
the information required on the client side when we issue a delete of a
data row. For example, if a column represented by a KeyValue in HBase is
indexed, when the row is deleted, we don't know what the value of the
KeyValue was.

In terms of data formats, the index row key is defined by the columns
you're indexing. We also tack on any pk column values from the data table
that aren't indexed (that's how we ensure the 1:1 relationship between data
rows and index rows). We have to translate the data type of the indexed
data table column sometimes (see IndexUtil.getIndexColumnDataType()). The
reason is that for fixed width types (such as INTEGER and BIGINT), we don't
have a representation for null. We need to be able to represent null
because in the index, the column value is appearing in the row key and we
need null values to appear before any other value. The way we do this is by
translating from INTEGER/BIGINT to DECIMAL for the index table. You can see
this if you examine the schema of an index table and compare it against the
schema of a data table (you can do this easily in SQuirrel).

Failure scenarios are interesting to think about. For example, what happens
if a RS hosting an index region goes down right after you've updated the
data region? For immutable indexes, you're basically SOL (though the story
improves with transactions - more on that later). At this point, your index
and data table are out-of-sync. Since your data is immutable, you can
safely replay your commit until it succeeds. If you wanted to, you could
even force the upper bound of your queries to be "frozen" at the timestamp
at which you know the index is valid. Another option would be to mark the
index as disabled (so it's no longer used), and rebuild it in the
background. These are all techniques that an application can do to deal
with failures, but Phoenix doesn't take any action (other than throwing
whatever exception occurred at write time so the client knows about it).

For mutable indexes, we do a bit more, but it's far from perfect as well.
By default, if a write failure occurs, we mark the index as disabled and
have a background task that tries to catch the index up once the region can
be reached again (followed by enabling the index again). We also write the
index updates with the data updates in the WAL, so if the RS goes down,
we'll write the data updates and index updates when it comes back up.
There's also some work underway (PHOENIX-2221) to fail writes to data
tables if the index table cannot be written to.

Another consideration, depending on your use case is write skew. It's
possible that a batch of mutations may hit the data table before they hit
the index table (based on how the updates occur as described above). Until
both batches are processed, the index and data table are not *exactly* in

With support for transactions, secondary index updates can be transactional
wrt data updates, in which case there's no scenario where your data table
and index table get out of sync (including the write skew case). For
write-once/append-only data, this works well because there's not a lot of
overhead for transactions (since by definition, no write conflicts are
possible, so no conflict detection is necessary).

HTH. Please let me know if you'd like more info as you get into the code


On Mon, Jan 4, 2016 at 9:43 AM, Nick Dimiduk <> wrote:

> Happy New Year everyone!
> I'd like to come up to speed on our secondary index implementations. I've
> combed through the doc page [0] and experimented with tweaking index
> definitions and examining changes to the explain plan output. Now I'd like
> to get deeper into index formats, maintenance strategies, &c. I'm mostly
> interested in the IMMUTABLE_ROWS=true details. Do any of you have a doc,
> deck, or video I can watch to go the next step?
> Thanks a lot,
> -n
> [0]:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message