accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Reichman <mreich...@pixelforensics.com>
Subject Re: AccumuloInputFormat and data locality for jobs that don't need keys sorted
Date Tue, 02 Aug 2016 13:30:23 GMT
Hi Mario,

In my experiences, the performance/locality help from
AccumuloInputFormat/AccumuloRowInputFormat tends to be less helpful if you
add a lot of ranges. In some cases, I've found there's an efficiency curve
you can experiment with where it's sometimes faster to just locally throw
out data vs use many ranges. I've been using Accumulo with classic
MapReduce, YARN MapReduce, and Spark for a while and this has held true on
all of those platforms.

Good luck!

Marc

On Mon, Aug 1, 2016 at 6:55 PM, Mario Pastorelli <
mario.pastorelli@teralytics.ch> wrote:

> I would like to use an Accumulo table as input for a Spark job. Let me
> clarify that my job doesn't need keys sorted and Accumulo is purely used to
> filter the input data thanks to it's index on the keys. The data that I
> need to process in Spark is still a small portion of the full dataset.
> I know that Accumulo provides the AccumuloInputFormat but in my tests
> almost no task has data locality when I use this input format which leads
> to poor performance. I'm not sure why this happens but my guess is that
> the  AccumuloInputFormat creates one task per range.
> I wonder if there is a way to tell to the AccumuloInputFormat to split
> each range into the sub-ranges local to each tablet server so that each
> task in Spark will will read only data from the same machines where it is
> running.
>
> Thanks for the help,
> Mario
>
> --
> Mario Pastorelli | TERALYTICS
>
> *software engineer*
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone: +41794381682
> email: mario.pastorelli@teralytics.ch
> www.teralytics.net
>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
> de Vries
>
> This e-mail message contains confidential information which is for the
> sole attention and use of the intended recipient. Please notify us at once
> if you think that it may not be intended for you and delete it immediately.
>

Mime
View raw message