accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Medinets <david.medin...@gmail.com>
Subject Re: Feedback about techniques for tuning batch scanning for my problem
Date Fri, 20 May 2016 13:49:57 GMT
How many cores are on your servers? There are several thread counts you can
change. Even +1 thread per server counts at some point if you have enough
servers in the cluster.

On Fri, May 20, 2016 at 2:54 AM, Mario Pastorelli <
mario.pastorelli@teralytics.ch> wrote:

> You mean the BatchScanner number of threads? I've made it parametric and
> usually I use 1 or 2 threads per tablet server. Going up doesn't seem to do
> anything for the performance.
>
> On Thu, May 19, 2016 at 6:21 PM, David Medinets <david.medinets@gmail.com>
> wrote:
>
>> Have you tuned thread counts?
>> On May 19, 2016 11:08 AM, "Mario Pastorelli" <
>> mario.pastorelli@teralytics.ch> wrote:
>>
>>> Hey people,
>>> I'm trying to tune a bit the query performance to see how fast it can go
>>> and I thought it would be great to have comments from the community. The
>>> problem that I'm trying to solve in Accumulo is the following: we want to
>>> store the entities that have been in a certain location in a certain day.
>>> The location is a Long and the entity id is a Long. I want to be able to
>>> scan ~1M of rows in few seconds, possibly less than one. Right now, I'm
>>> doing the following things:
>>>
>>>    1. I'm using a sharding byte at the start of the rowId to keep the
>>>    data in the same range distributed in the cluster
>>>    2. all the records are encoded, one single record is composed by
>>>       1. rowId: 1 shard byte + 3 bytes for the day
>>>       2. column family: 8 byte for the long corresponding to the hash
>>>       of the location
>>>       3. column qualifier: 8 byte corresponding to the identifier of
>>>       the entity
>>>       4. value: 2 bytes for some additional information
>>>    3. I use a batch scanner because I don't need sorting and it's faster
>>>
>>> As expected, it takes few seconds to scan 1M rows but now I'm wondering
>>> if I can improve it. My ideas are the following:
>>>
>>>    1. set table.compaction.major.ration to 1 because I don't care about
>>>    the ingestion performance and this should improve the query performance
>>>    2. pre-split tables to match the number of servers and then use a
>>>    byte of shard as first byte of the rowId. This should improve both writing
>>>    and reading the data because both should work in parallel for what I
>>>    understood
>>>    3. enable bloom filter on the table
>>>
>>> Do you think those ideas make sense? Furthermore, I have two questions:
>>>
>>>    1. considering that a single entry is only 22 bytes but I'm going to
>>>    scan ~1M records per query, do you think I should change the BatchScanner
>>>    buffers somehow?
>>>    2. anything else to improve the scan speed? Again, I don't care
>>>    about the ingestion time
>>>
>>> Thanks for the help!
>>>
>>> --
>>> Mario Pastorelli | TERALYTICS
>>>
>>> *software engineer*
>>>
>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>> phone: +41794381682
>>> email: mario.pastorelli@teralytics.ch
>>> www.teralytics.net
>>>
>>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>>> Zurich
>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>>> Yann de Vries
>>>
>>> This e-mail message contains confidential information which is for the
>>> sole attention and use of the intended recipient. Please notify us at once
>>> if you think that it may not be intended for you and delete it immediately.
>>>
>>
>
>
> --
> Mario Pastorelli | TERALYTICS
>
> *software engineer*
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone: +41794381682
> email: mario.pastorelli@teralytics.ch
> www.teralytics.net
>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
> de Vries
>
> This e-mail message contains confidential information which is for the
> sole attention and use of the intended recipient. Please notify us at once
> if you think that it may not be intended for you and delete it immediately.
>

Mime
View raw message