accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Hughes <>
Subject Re: AccumuloInputFormat and data locality for jobs that don't need keys sorted
Date Tue, 02 Aug 2016 15:19:30 GMT
Hi Mario,

If I recall, the AccumuloInputFormat wants to have a task per range by
default.  For geospatial indexing (or other non-trivial cases), a query
plan can generate a bunch of ranges.  Eugene Cheipesh wrote in about a year
and a half ago about this same issue as it came up in GeoTrellis:

>From that discussion it sounded like a JIRA ticket and maybe some work was
underway.  I remember one of the GeoMesans looking at what Eugene had done
and implementing a similar idea in our GeoMesaInputFormat.  Since cloud
configurations can vary quite a bit, we left some light hooks to allow for

Overall, I think this is likely to be something one has to play with in
terms of particular use case and cloud.



Links in case they help:

On Tue, Aug 2, 2016 at 9:30 AM, Marc Reichman <>

> Hi Mario,
> In my experiences, the performance/locality help from
> AccumuloInputFormat/AccumuloRowInputFormat tends to be less helpful if you
> add a lot of ranges. In some cases, I've found there's an efficiency curve
> you can experiment with where it's sometimes faster to just locally throw
> out data vs use many ranges. I've been using Accumulo with classic
> MapReduce, YARN MapReduce, and Spark for a while and this has held true on
> all of those platforms.
> Good luck!
> Marc
> On Mon, Aug 1, 2016 at 6:55 PM, Mario Pastorelli <
>> wrote:
>> I would like to use an Accumulo table as input for a Spark job. Let me
>> clarify that my job doesn't need keys sorted and Accumulo is purely used to
>> filter the input data thanks to it's index on the keys. The data that I
>> need to process in Spark is still a small portion of the full dataset.
>> I know that Accumulo provides the AccumuloInputFormat but in my tests
>> almost no task has data locality when I use this input format which leads
>> to poor performance. I'm not sure why this happens but my guess is that
>> the  AccumuloInputFormat creates one task per range.
>> I wonder if there is a way to tell to the AccumuloInputFormat to split
>> each range into the sub-ranges local to each tablet server so that each
>> task in Spark will will read only data from the same machines where it is
>> running.
>> Thanks for the help,
>> Mario
>> --
>> Mario Pastorelli | TERALYTICS
>> *software engineer*
>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>> phone: +41794381682
>> email:
>> Company registration number: CH- | Trade register Canton
>> Zurich
>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>> Yann de Vries
>> This e-mail message contains confidential information which is for the
>> sole attention and use of the intended recipient. Please notify us at once
>> if you think that it may not be intended for you and delete it immediately.

View raw message