accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Feedback about techniques for tuning batch scanning for my problem
Date Mon, 23 May 2016 11:07:58 GMT
Hi Mario,

If you have a finite number of locations, you could also try configuring a
locality group for each location. This would prune out a significant amount
of data.

I also wonder if you might have better performance by making the row a
concatenation of your location and entity identifier. I think actual
performance of this compared to what you have now would depend on the
number of entities and locations per day. This might be something you can
experiment with. Instead of adding some shard bit, you could reduce the
split threshold to get more parallelism.

I don't have the manual in front of me, but there is a property which
controls the server side batch side (how much data will be collected by a
server before it's send back to your batchscanner). If you have a lot of
processing by the client, you could lower that buffer to receive smaller
batches of data more frequently.
On May 19, 2016 11:08 AM, "Mario Pastorelli" <mario.pastorelli@teralytics.ch>
wrote:

> Hey people,
> I'm trying to tune a bit the query performance to see how fast it can go
> and I thought it would be great to have comments from the community. The
> problem that I'm trying to solve in Accumulo is the following: we want to
> store the entities that have been in a certain location in a certain day.
> The location is a Long and the entity id is a Long. I want to be able to
> scan ~1M of rows in few seconds, possibly less than one. Right now, I'm
> doing the following things:
>
>    1. I'm using a sharding byte at the start of the rowId to keep the
>    data in the same range distributed in the cluster
>    2. all the records are encoded, one single record is composed by
>       1. rowId: 1 shard byte + 3 bytes for the day
>       2. column family: 8 byte for the long corresponding to the hash of
>       the location
>       3. column qualifier: 8 byte corresponding to the identifier of the
>       entity
>       4. value: 2 bytes for some additional information
>    3. I use a batch scanner because I don't need sorting and it's faster
>
> As expected, it takes few seconds to scan 1M rows but now I'm wondering if
> I can improve it. My ideas are the following:
>
>    1. set table.compaction.major.ration to 1 because I don't care about
>    the ingestion performance and this should improve the query performance
>    2. pre-split tables to match the number of servers and then use a byte
>    of shard as first byte of the rowId. This should improve both writing and
>    reading the data because both should work in parallel for what I understood
>    3. enable bloom filter on the table
>
> Do you think those ideas make sense? Furthermore, I have two questions:
>
>    1. considering that a single entry is only 22 bytes but I'm going to
>    scan ~1M records per query, do you think I should change the BatchScanner
>    buffers somehow?
>    2. anything else to improve the scan speed? Again, I don't care about
>    the ingestion time
>
> Thanks for the help!
>
> --
> Mario Pastorelli | TERALYTICS
>
> *software engineer*
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone: +41794381682
> email: mario.pastorelli@teralytics.ch
> www.teralytics.net
>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
> de Vries
>
> This e-mail message contains confidential information which is for the
> sole attention and use of the intended recipient. Please notify us at once
> if you think that it may not be intended for you and delete it immediately.
>

Mime
View raw message