accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Medinets <david.medin...@gmail.com>
Subject Re: Feedback about techniques for tuning batch scanning for my problem
Date Thu, 19 May 2016 16:21:45 GMT
Have you tuned thread counts?
On May 19, 2016 11:08 AM, "Mario Pastorelli" <mario.pastorelli@teralytics.ch>
wrote:

> Hey people,
> I'm trying to tune a bit the query performance to see how fast it can go
> and I thought it would be great to have comments from the community. The
> problem that I'm trying to solve in Accumulo is the following: we want to
> store the entities that have been in a certain location in a certain day.
> The location is a Long and the entity id is a Long. I want to be able to
> scan ~1M of rows in few seconds, possibly less than one. Right now, I'm
> doing the following things:
>
>    1. I'm using a sharding byte at the start of the rowId to keep the
>    data in the same range distributed in the cluster
>    2. all the records are encoded, one single record is composed by
>       1. rowId: 1 shard byte + 3 bytes for the day
>       2. column family: 8 byte for the long corresponding to the hash of
>       the location
>       3. column qualifier: 8 byte corresponding to the identifier of the
>       entity
>       4. value: 2 bytes for some additional information
>    3. I use a batch scanner because I don't need sorting and it's faster
>
> As expected, it takes few seconds to scan 1M rows but now I'm wondering if
> I can improve it. My ideas are the following:
>
>    1. set table.compaction.major.ration to 1 because I don't care about
>    the ingestion performance and this should improve the query performance
>    2. pre-split tables to match the number of servers and then use a byte
>    of shard as first byte of the rowId. This should improve both writing and
>    reading the data because both should work in parallel for what I understood
>    3. enable bloom filter on the table
>
> Do you think those ideas make sense? Furthermore, I have two questions:
>
>    1. considering that a single entry is only 22 bytes but I'm going to
>    scan ~1M records per query, do you think I should change the BatchScanner
>    buffers somehow?
>    2. anything else to improve the scan speed? Again, I don't care about
>    the ingestion time
>
> Thanks for the help!
>
> --
> Mario Pastorelli | TERALYTICS
>
> *software engineer*
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone: +41794381682
> email: mario.pastorelli@teralytics.ch
> www.teralytics.net
>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
> de Vries
>
> This e-mail message contains confidential information which is for the
> sole attention and use of the intended recipient. Please notify us at once
> if you think that it may not be intended for you and delete it immediately.
>

Mime
View raw message