accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dylan Hutchison (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (ACCUMULO-3978) Enable safe and efficient Scans within tserver iterators
Date Fri, 28 Aug 2015 01:11:46 GMT

     [ https://issues.apache.org/jira/browse/ACCUMULO-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dylan Hutchison updated ACCUMULO-3978:
--------------------------------------
    Component/s: tserver

> Enable safe and efficient Scans within tserver iterators
> --------------------------------------------------------
>
>                 Key: ACCUMULO-3978
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3978
>             Project: Accumulo
>          Issue Type: New Feature
>          Components: tserver
>            Reporter: Dylan Hutchison
>
> Operations such as joining two tables require scanning more than one table.  Naive solutions
use a client to scan two tables from Accumulo and write the result back to Accumulo.  Performing
the join inside Accumulo eliminates needless network traffic and gains significant performance.
 
> One way to do a join inside Accumulo is to scan the first table with a scan-time iterator,
which opens another Scanner to the second table.  The approach works well most of the time.
 In rare cases when tablet servers are constrained for resources and clients aggressively
use many scan threads, scans-within-scans lead to deadlock.  See ACCUMULO-3975.
> We can gain both stability (eliminating rare but possible deadlock) and performance by
supporting scans-within-scans more directly than opening up a new Scanner.  For the sake of
argument, suppose we create a new method on the {{IteratorEnvironment}} interface called {{getInstanceScanner}}
(and respectively BatchScanner), which returns a Scanner implementation connected to the scanned
Instance with the same user and authorizations as the user that initiated the original scan.
 
> This new Scanner can now take advantage of the fact that it is already inside a tablet
server.  If it is used to scan a tablet that is in the same tablet server, then it can open
the appropriate RFiles and in-memory maps and read them by direct method call, rather than
go through a Thrift serialize-transmit-deserialize path from one port of the tablet server
to another.  If it is used to scan a tablet in another tablet server, then it can use Thrift
as usual.
> The above scheme always avoids deadlock as long as a user restricts himself to scans-within-scans.
 To prevent scans-within-scans-within-scans from deadlocking, we could store the original
scan ID with scans, or spawn a new thread in a tablet server whenever it goes into a waiting
state as the result of opening a new Scanner and waiting to receive entries.  Or maybe some
re-entrant work could do the trick.  Or we don't support it at all.
> Further optimizations and ideas are possible.  For example we may use scans from one
tablet server to another as an opportunity to replicate data between tablet server.  Let's
use good design to choose a simple and reliable idea.
> Remember that if the scan-inside-a-scan is to the same table as the original scan, an
SKVI {{deepCopy}} may solve the problem.  There may be some mechanisms in deepCopy we can
reuse or adapt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message