accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dylan Hutchison (JIRA)" <>
Subject [jira] [Created] (ACCUMULO-3978) Enable safe and efficient Scans within tserver iterators
Date Fri, 28 Aug 2015 01:07:45 GMT
Dylan Hutchison created ACCUMULO-3978:

             Summary: Enable safe and efficient Scans within tserver iterators
                 Key: ACCUMULO-3978
             Project: Accumulo
          Issue Type: New Feature
            Reporter: Dylan Hutchison

Operations such as joining two tables require scanning more than one table.  Naive solutions
use a client to scan two tables from Accumulo and write the result back to Accumulo.  Performing
the join inside Accumulo eliminates needless network traffic and gains significant performance.

One way to do a join inside Accumulo is to scan the first table with a scan-time iterator,
which opens another Scanner to the second table.  The approach works well most of the time.
 In rare cases when tablet servers are constrained for resources and clients aggressively
use many scan threads, scans-within-scans lead to deadlock.  See ACCUMULO-3975.

We can gain both stability (eliminating rare but possible deadlock) and performance by supporting
scans-within-scans more directly than opening up a new Scanner.  For the sake of argument,
suppose we create a new method on the {{IteratorEnvironment}} interface called {{getInstanceScanner}}
(and respectively BatchScanner), which returns a Scanner implementation connected to the scanned
Instance with the same user and authorizations as the user that initiated the original scan.

This new Scanner can now take advantage of the fact that it is already inside a tablet server.
 If it is used to scan a tablet that is in the same tablet server, then it can open the appropriate
RFiles and in-memory maps and read them by direct method call, rather than go through a Thrift
serialize-transmit-deserialize path from one port of the tablet server to another.  If it
is used to scan a tablet in another tablet server, then it can use Thrift as usual.

The above scheme always avoids deadlock as long as a user restricts himself to scans-within-scans.
 To prevent scans-within-scans-within-scans from deadlocking, we could store the original
scan ID with scans, or spawn a new thread in a tablet server whenever it goes into a waiting
state as the result of opening a new Scanner and waiting to receive entries.  Or maybe some
re-entrant work could do the trick.  Or we don't support it at all.

Further optimizations and ideas are possible.  For example we may use scans from one tablet
server to another as an opportunity to replicate data between tablet server.  Let's use good
design to choose a simple and reliable idea.

Remember that if the scan-inside-a-scan is to the same table as the original scan, an SKVI
{{deepCopy}} may solve the problem.  There may be some mechanisms in deepCopy we can reuse
or adapt.

This message was sent by Atlassian JIRA

View raw message