accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <ke...@deenlo.com>
Subject Re: AccumuloInputFormat and data locality for jobs that don't need keys sorted
Date Tue, 02 Aug 2016 15:56:03 GMT
If you are not aware of it, something else to consider is the
setOfflineTableScan[1] option.  This can support much faster reads of
data.  In my experience this usually only useful for map only jobs
like you are doing.  When doing map/reduce the sort can make a speedup
in map read rate irrelevant.

You still may not get locality if tablets have multiple files because
a merged read of the files is done in the mapper.   Offline map reduce
in Accumulo attempts to run mappers at the last location a tablet
compacted some of its files.  Even w/o locality you still avoid the
cost of de-serializing , re-serializing, transmission, and
de-serializing data in the tserver+client.

[1]: http://accumulo.apache.org/1.7/apidocs/org/apache/accumulo/core/client/mapred/InputFormatBase.html#setOfflineTableScan%28org.apache.hadoop.mapred.JobConf,%20boolean%29

On Mon, Aug 1, 2016 at 7:55 PM, Mario Pastorelli
<mario.pastorelli@teralytics.ch> wrote:
> I would like to use an Accumulo table as input for a Spark job. Let me
> clarify that my job doesn't need keys sorted and Accumulo is purely used to
> filter the input data thanks to it's index on the keys. The data that I need
> to process in Spark is still a small portion of the full dataset.
> I know that Accumulo provides the AccumuloInputFormat but in my tests almost
> no task has data locality when I use this input format which leads to poor
> performance. I'm not sure why this happens but my guess is that the
> AccumuloInputFormat creates one task per range.
> I wonder if there is a way to tell to the AccumuloInputFormat to split each
> range into the sub-ranges local to each tablet server so that each task in
> Spark will will read only data from the same machines where it is running.
>
> Thanks for the help,
> Mario
>
> --
> Mario Pastorelli | TERALYTICS
>
> software engineer
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone: +41794381682
> email: mario.pastorelli@teralytics.ch
> www.teralytics.net
>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
> de Vries
>
> This e-mail message contains confidential information which is for the sole
> attention and use of the intended recipient. Please notify us at once if you
> think that it may not be intended for you and delete it immediately.

Mime
View raw message