accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Reichman <>
Subject Re: Feedback about techniques for tuning batch scanning for my problem
Date Thu, 19 May 2016 15:53:57 GMT
Hi Mario,

Not sure where this plays into your data integrity, but have you looked
into these settings in hdfs-site.xml?

These make for a somewhat dramatic increase in HDFS read performance if
data is distributed well enough around..

I can't speak as much to the scanner params, but you may look into these as


On Thu, May 19, 2016 at 10:08 AM, Mario Pastorelli <> wrote:

> Hey people,
> I'm trying to tune a bit the query performance to see how fast it can go
> and I thought it would be great to have comments from the community. The
> problem that I'm trying to solve in Accumulo is the following: we want to
> store the entities that have been in a certain location in a certain day.
> The location is a Long and the entity id is a Long. I want to be able to
> scan ~1M of rows in few seconds, possibly less than one. Right now, I'm
> doing the following things:
>    1. I'm using a sharding byte at the start of the rowId to keep the
>    data in the same range distributed in the cluster
>    2. all the records are encoded, one single record is composed by
>       1. rowId: 1 shard byte + 3 bytes for the day
>       2. column family: 8 byte for the long corresponding to the hash of
>       the location
>       3. column qualifier: 8 byte corresponding to the identifier of the
>       entity
>       4. value: 2 bytes for some additional information
>    3. I use a batch scanner because I don't need sorting and it's faster
> As expected, it takes few seconds to scan 1M rows but now I'm wondering if
> I can improve it. My ideas are the following:
>    1. set table.compaction.major.ration to 1 because I don't care about
>    the ingestion performance and this should improve the query performance
>    2. pre-split tables to match the number of servers and then use a byte
>    of shard as first byte of the rowId. This should improve both writing and
>    reading the data because both should work in parallel for what I understood
>    3. enable bloom filter on the table
> Do you think those ideas make sense? Furthermore, I have two questions:
>    1. considering that a single entry is only 22 bytes but I'm going to
>    scan ~1M records per query, do you think I should change the BatchScanner
>    buffers somehow?
>    2. anything else to improve the scan speed? Again, I don't care about
>    the ingestion time
> Thanks for the help!
> --
> Mario Pastorelli | TERALYTICS
> *software engineer*
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone: +41794381682
> email:
> Company registration number: CH- | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
> de Vries
> This e-mail message contains confidential information which is for the
> sole attention and use of the intended recipient. Please notify us at once
> if you think that it may not be intended for you and delete it immediately.

View raw message