accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Fuchs <afu...@apache.org>
Subject Re: Scans during Compaction
Date Mon, 23 Feb 2015 17:48:43 GMT
Dylan,

The effect of a major compaction is never seen in queries before the major
compaction completes. At the end of the major compaction there is a
multi-phase commit which eventually replaces all of the old files with the
new file. At that point the major compaction will have completely processed
the given tablet's data (although other tablets may not be synchronized).
For long-running non-isolated queries (more than a second or so) the
iterator tree is occasionally rebuilt and re-seeked. When it is rebuilt it
will use whatever is the latest file set, which will include the results of
a completed major compaction.

In your case #1 that's a tricky guarantee to make across a whole tablet,
but it can be made one row at a time by using an isolated iterator.

To make your case #2 work, you probably will have to implement some
higher-level logic to only start your query after the major compaction has
completed, using an external mechanism to track the completion of your
transformation.

Adam


On Mon, Feb 23, 2015 at 12:35 PM, Dylan Hutchison <dhutchis@stevens.edu>
wrote:

> Hello all,
>
> When I initiate a full major compaction (with flushing turned on) manually via
> the Accumulo API
> <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/client/admin/TableOperations.html#compact(java.lang.String,%20org.apache.hadoop.io.Text,%20org.apache.hadoop.io.Text,%20java.util.List,%20boolean,%20boolean)>,
> how does the table appear to
>
>    1. clients that started scanning the table before the major compaction
>    began;
>    2. clients that start scanning during the major compaction?
>
> I'm interested in the case where there is an iterator attached to the full
> major compaction that modifies entries (respecting sorted order of entries).
>
> The best possible answer for my use case, with case #2 more important than
> case #1 and *low latency* more important than high throughput, is that
>
>    1. clients that started scanning before the compaction began would not
>    see entries altered by the compaction-time iterator;
>    2. clients that start scanning during the major compaction stream back
>    entries as they finish processing from the major compaction, such that the
>    clients *only* see entries that have passed through the
>    compaction-time iterator.
>
> How accurate are these descriptions?  If #2 really were as I would like it
> to be, then a scan on the range (-inf,+inf) started after compaction would
> "monitor compaction progress," such that the first entry batch transmits to
> the scanner as soon as it is available from the major compaction, and the
> scanner finishes (receives all entries) exactly when the compaction
> finishes.  If this is not possible, I may make something to that effect by
> calling the blocking version of compact().
>
> Bonus: how does cancelCompaction()
> <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/client/admin/TableOperations.html#cancelCompaction(java.lang.String)>
> affect clients scanning in case #1 and case #2?
>
> Regards,
> Dylan Hutchison
>

Mime
View raw message