accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Busbey <>
Subject Re: OfflineScanner
Date Thu, 19 Feb 2015 15:59:04 GMT
Hi Marc!

Yep, you can do this using the optional "setOfflineTableScan" on
AccumuloInputFormat[1]. It still requires that the table be offline.

There's a good example of programmatically creating an offline clone if you
look at the MR job we use to verify the "Continuous Ingest" integration

      Random random = new Random();
      clone = opts.getTableName() + "_" + String.format("%016x",
(random.nextLong() & 0x7fffffffffffffffl));
      conn = opts.getConnector();
      conn.tableOperations().clone(opts.getTableName(), clone, true, new
HashMap<String,String>(), new HashSet<String>());
      ranges =
conn.tableOperations().splitRangeByTablets(opts.getTableName(), new
Range(), opts.maxMaps);
      AccumuloInputFormat.setInputTableName(job, clone);
      AccumuloInputFormat.setOfflineTableScan(job, true);

[1]: * <>*
[2]: * <>*

On Thu, Feb 19, 2015 at 9:47 AM, Marc Reichman <
> wrote:

> Apologies for hijacking this, but is there any way to use an offline table
> clone with MapReduce and AccumuloInputFormat? That read speed increase
> sounds very appealing..
> On Thu, Feb 19, 2015 at 9:27 AM, Josh Elser <> wrote:
>> Typically, if you're using the OfflineScanner, you'd clone the table you
>> want to read and then take the clone offline. It's a simple (and fast)
>> solution that doesn't interrupt the availability of the table.
>> Doing the read offline will definitely be faster (maybe 20%, I'm not
>> entirely sure on accurate number and how it scales with nodes). The pain
>> would be the extra work in creating the clone, offline'ing the table, and
>> eventually deleting the clone when you're done with it. A little more work,
>> but manageable.
>> Ara Ebrahimi wrote:
>>> Hi,
>>> I’m trying to optimize a connector we’ve written for Presto. In some
>>> cases we need to perform full table scans. This happens across all the
>>> nodes but each node is assigned to process only a sharded subset of data.
>>> Each shard is hosted by only 1 RFile. I’m looking at the
>>> AbstractInputFormat and OfflineIterator and it seems like the code is not
>>> that hard to use for this case. Is there any drawback? It seems like if the
>>> table is offline then OfflineIterator is used which apparently reads the
>>> RFiles directly and doesn’t involve any RPC and I think should be
>>> significantly faster. Is it so? Is there any drawback to using this while
>>> the table is not offline but no other app is messing with the table?
>>> Thanks,
>>> Ara.
>>> ________________________________
>>> This message is for the designated recipient only and may contain
>>> privileged, proprietary, or otherwise confidential information. If you have
>>> received it in error, please notify the sender immediately and delete the
>>> original. Any other use of the e-mail by you is prohibited. Thank you in
>>> advance for your cooperation.
>>> ________________________________


View raw message