accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Feedback about techniques for tuning batch scanning for my problem
Date Mon, 23 May 2016 11:09:40 GMT
This probably isn't a big issue unless you're running into stability issues
with Accumulo. They're both designed to scale horizontally. Unless you have
a reason that they can't be colocated, it's fine.
On May 21, 2016 2:29 PM, "David Medinets" <david.medinets@gmail.com> wrote:

> Why are you sharing the machines accumulo and Spark? Does Spark give you
> any kind of data locality that accumlo does? Could it be better to use the
> full amount of memory for each?
> On May 21, 2016 1:15 PM, "Mario Pastorelli" <
> mario.pastorelli@teralytics.ch> wrote:
>
>> Currently setting the number of threads to both the number of servers and
>> the number of cores yield to the similar performance for scanning with
>> BatchScanner. Thanks for the advice, I will try to use half of cores of
>> each machines on the cluster.
>>
>> Anything else?
>>
>> On Sat, May 21, 2016 at 5:03 AM, David Medinets <david.medinets@gmail.com
>> > wrote:
>>
>>> It's been a few years so I don't remember the specific property names.
>>> Set one thread count to the number of servers times the number of cores to
>>> start. Divide by .5 if spark is equally as active as  accumulo. Look in
>>> properties.java for the property names.
>>>
>>> On Fri, May 20, 2016 at 10:09 AM, Mario Pastorelli <
>>> mario.pastorelli@teralytics.ch> wrote:
>>>
>>>> Machines have 32 cores shared between Accumulo and Spark. Each machine
>>>> has 5 disks on which there is HDFS and that Accumulo can use. How many
>>>> threads I should used?
>>>>
>>>> On Fri, May 20, 2016 at 3:49 PM, David Medinets <
>>>> david.medinets@gmail.com> wrote:
>>>>
>>>>> How many cores are on your servers? There are several thread counts
>>>>> you can change. Even +1 thread per server counts at some point if you
have
>>>>> enough servers in the cluster.
>>>>>
>>>>> On Fri, May 20, 2016 at 2:54 AM, Mario Pastorelli <
>>>>> mario.pastorelli@teralytics.ch> wrote:
>>>>>
>>>>>> You mean the BatchScanner number of threads? I've made it parametric
>>>>>> and usually I use 1 or 2 threads per tablet server. Going up doesn't
seem
>>>>>> to do anything for the performance.
>>>>>>
>>>>>> On Thu, May 19, 2016 at 6:21 PM, David Medinets <
>>>>>> david.medinets@gmail.com> wrote:
>>>>>>
>>>>>>> Have you tuned thread counts?
>>>>>>> On May 19, 2016 11:08 AM, "Mario Pastorelli" <
>>>>>>> mario.pastorelli@teralytics.ch> wrote:
>>>>>>>
>>>>>>>> Hey people,
>>>>>>>> I'm trying to tune a bit the query performance to see how
fast it
>>>>>>>> can go and I thought it would be great to have comments from
the community.
>>>>>>>> The problem that I'm trying to solve in Accumulo is the following:
we want
>>>>>>>> to store the entities that have been in a certain location
in a certain
>>>>>>>> day. The location is a Long and the entity id is a Long.
I want to be able
>>>>>>>> to scan ~1M of rows in few seconds, possibly less than one.
Right now, I'm
>>>>>>>> doing the following things:
>>>>>>>>
>>>>>>>>    1. I'm using a sharding byte at the start of the rowId
to keep
>>>>>>>>    the data in the same range distributed in the cluster
>>>>>>>>    2. all the records are encoded, one single record is composed
by
>>>>>>>>       1. rowId: 1 shard byte + 3 bytes for the day
>>>>>>>>       2. column family: 8 byte for the long corresponding
to the
>>>>>>>>       hash of the location
>>>>>>>>       3. column qualifier: 8 byte corresponding to the identifier
>>>>>>>>       of the entity
>>>>>>>>       4. value: 2 bytes for some additional information
>>>>>>>>    3. I use a batch scanner because I don't need sorting
and it's
>>>>>>>>    faster
>>>>>>>>
>>>>>>>> As expected, it takes few seconds to scan 1M rows but now
I'm
>>>>>>>> wondering if I can improve it. My ideas are the following:
>>>>>>>>
>>>>>>>>    1. set table.compaction.major.ration to 1 because I don't
care
>>>>>>>>    about the ingestion performance and this should improve
the query
>>>>>>>>    performance
>>>>>>>>    2. pre-split tables to match the number of servers and
then use
>>>>>>>>    a byte of shard as first byte of the rowId. This should
improve both
>>>>>>>>    writing and reading the data because both should work
in parallel for what
>>>>>>>>    I understood
>>>>>>>>    3. enable bloom filter on the table
>>>>>>>>
>>>>>>>> Do you think those ideas make sense? Furthermore, I have
two
>>>>>>>> questions:
>>>>>>>>
>>>>>>>>    1. considering that a single entry is only 22 bytes but
I'm
>>>>>>>>    going to scan ~1M records per query, do you think I should
change the
>>>>>>>>    BatchScanner buffers somehow?
>>>>>>>>    2. anything else to improve the scan speed? Again, I don't
care
>>>>>>>>    about the ingestion time
>>>>>>>>
>>>>>>>> Thanks for the help!
>>>>>>>>
>>>>>>>> --
>>>>>>>> Mario Pastorelli | TERALYTICS
>>>>>>>>
>>>>>>>> *software engineer*
>>>>>>>>
>>>>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>>>>>> phone: +41794381682
>>>>>>>> email: mario.pastorelli@teralytics.ch
>>>>>>>> www.teralytics.net
>>>>>>>>
>>>>>>>> Company registration number: CH-020.3.037.709-7 | Trade register
>>>>>>>> Canton Zurich
>>>>>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark
>>>>>>>> Schmitz, Yann de Vries
>>>>>>>>
>>>>>>>> This e-mail message contains confidential information which
is for
>>>>>>>> the sole attention and use of the intended recipient. Please
notify us at
>>>>>>>> once if you think that it may not be intended for you and
delete it
>>>>>>>> immediately.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Mario Pastorelli | TERALYTICS
>>>>>>
>>>>>> *software engineer*
>>>>>>
>>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>>>> phone: +41794381682
>>>>>> email: mario.pastorelli@teralytics.ch
>>>>>> www.teralytics.net
>>>>>>
>>>>>> Company registration number: CH-020.3.037.709-7 | Trade register
>>>>>> Canton Zurich
>>>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>>>>>> Yann de Vries
>>>>>>
>>>>>> This e-mail message contains confidential information which is for
>>>>>> the sole attention and use of the intended recipient. Please notify
us at
>>>>>> once if you think that it may not be intended for you and delete
it
>>>>>> immediately.
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Mario Pastorelli | TERALYTICS
>>>>
>>>> *software engineer*
>>>>
>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>> phone: +41794381682
>>>> email: mario.pastorelli@teralytics.ch
>>>> www.teralytics.net
>>>>
>>>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>>>> Zurich
>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>>>> Yann de Vries
>>>>
>>>> This e-mail message contains confidential information which is for the
>>>> sole attention and use of the intended recipient. Please notify us at once
>>>> if you think that it may not be intended for you and delete it immediately.
>>>>
>>>
>>>
>>
>>
>> --
>> Mario Pastorelli | TERALYTICS
>>
>> *software engineer*
>>
>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>> phone: +41794381682
>> email: mario.pastorelli@teralytics.ch
>> www.teralytics.net
>>
>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>> Zurich
>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>> Yann de Vries
>>
>> This e-mail message contains confidential information which is for the
>> sole attention and use of the intended recipient. Please notify us at once
>> if you think that it may not be intended for you and delete it immediately.
>>
>

Mime
View raw message