accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Reichman <mreich...@pixelforensics.com>
Subject Re: Feedback about techniques for tuning batch scanning for my problem
Date Mon, 23 May 2016 13:12:39 GMT
I've been successful with this same model, HDFS and TServers on the same
host, to take advantage of those shortcircuit settings. They make a major
difference if your calculation problem is read-I/O bound, which for my
MapReduce/Spark applications, was the case. Depending on my row count or
precomputed table split, i have seen anywhere from 5% to 62% improvement in
overall job execution time.

On Mon, May 23, 2016 at 6:09 AM, Josh Elser <josh.elser@gmail.com> wrote:

> This probably isn't a big issue unless you're running into stability
> issues with Accumulo. They're both designed to scale horizontally. Unless
> you have a reason that they can't be colocated, it's fine.
> On May 21, 2016 2:29 PM, "David Medinets" <david.medinets@gmail.com>
> wrote:
>
>> Why are you sharing the machines accumulo and Spark? Does Spark give you
>> any kind of data locality that accumlo does? Could it be better to use the
>> full amount of memory for each?
>> On May 21, 2016 1:15 PM, "Mario Pastorelli" <
>> mario.pastorelli@teralytics.ch> wrote:
>>
>>> Currently setting the number of threads to both the number of servers
>>> and the number of cores yield to the similar performance for scanning with
>>> BatchScanner. Thanks for the advice, I will try to use half of cores of
>>> each machines on the cluster.
>>>
>>> Anything else?
>>>
>>> On Sat, May 21, 2016 at 5:03 AM, David Medinets <
>>> david.medinets@gmail.com> wrote:
>>>
>>>> It's been a few years so I don't remember the specific property names.
>>>> Set one thread count to the number of servers times the number of cores to
>>>> start. Divide by .5 if spark is equally as active as  accumulo. Look in
>>>> properties.java for the property names.
>>>>
>>>> On Fri, May 20, 2016 at 10:09 AM, Mario Pastorelli <
>>>> mario.pastorelli@teralytics.ch> wrote:
>>>>
>>>>> Machines have 32 cores shared between Accumulo and Spark. Each machine
>>>>> has 5 disks on which there is HDFS and that Accumulo can use. How many
>>>>> threads I should used?
>>>>>
>>>>> On Fri, May 20, 2016 at 3:49 PM, David Medinets <
>>>>> david.medinets@gmail.com> wrote:
>>>>>
>>>>>> How many cores are on your servers? There are several thread counts
>>>>>> you can change. Even +1 thread per server counts at some point if
you have
>>>>>> enough servers in the cluster.
>>>>>>
>>>>>> On Fri, May 20, 2016 at 2:54 AM, Mario Pastorelli <
>>>>>> mario.pastorelli@teralytics.ch> wrote:
>>>>>>
>>>>>>> You mean the BatchScanner number of threads? I've made it parametric
>>>>>>> and usually I use 1 or 2 threads per tablet server. Going up
doesn't seem
>>>>>>> to do anything for the performance.
>>>>>>>
>>>>>>> On Thu, May 19, 2016 at 6:21 PM, David Medinets <
>>>>>>> david.medinets@gmail.com> wrote:
>>>>>>>
>>>>>>>> Have you tuned thread counts?
>>>>>>>> On May 19, 2016 11:08 AM, "Mario Pastorelli" <
>>>>>>>> mario.pastorelli@teralytics.ch> wrote:
>>>>>>>>
>>>>>>>>> Hey people,
>>>>>>>>> I'm trying to tune a bit the query performance to see
how fast it
>>>>>>>>> can go and I thought it would be great to have comments
from the community.
>>>>>>>>> The problem that I'm trying to solve in Accumulo is the
following: we want
>>>>>>>>> to store the entities that have been in a certain location
in a certain
>>>>>>>>> day. The location is a Long and the entity id is a Long.
I want to be able
>>>>>>>>> to scan ~1M of rows in few seconds, possibly less than
one. Right now, I'm
>>>>>>>>> doing the following things:
>>>>>>>>>
>>>>>>>>>    1. I'm using a sharding byte at the start of the rowId
to keep
>>>>>>>>>    the data in the same range distributed in the cluster
>>>>>>>>>    2. all the records are encoded, one single record
is composed
>>>>>>>>>    by
>>>>>>>>>       1. rowId: 1 shard byte + 3 bytes for the day
>>>>>>>>>       2. column family: 8 byte for the long corresponding
to the
>>>>>>>>>       hash of the location
>>>>>>>>>       3. column qualifier: 8 byte corresponding to the
identifier
>>>>>>>>>       of the entity
>>>>>>>>>       4. value: 2 bytes for some additional information
>>>>>>>>>    3. I use a batch scanner because I don't need sorting
and it's
>>>>>>>>>    faster
>>>>>>>>>
>>>>>>>>> As expected, it takes few seconds to scan 1M rows but
now I'm
>>>>>>>>> wondering if I can improve it. My ideas are the following:
>>>>>>>>>
>>>>>>>>>    1. set table.compaction.major.ration to 1 because
I don't care
>>>>>>>>>    about the ingestion performance and this should improve
the query
>>>>>>>>>    performance
>>>>>>>>>    2. pre-split tables to match the number of servers
and then
>>>>>>>>>    use a byte of shard as first byte of the rowId. This
should improve both
>>>>>>>>>    writing and reading the data because both should work
in parallel for what
>>>>>>>>>    I understood
>>>>>>>>>    3. enable bloom filter on the table
>>>>>>>>>
>>>>>>>>> Do you think those ideas make sense? Furthermore, I have
two
>>>>>>>>> questions:
>>>>>>>>>
>>>>>>>>>    1. considering that a single entry is only 22 bytes
but I'm
>>>>>>>>>    going to scan ~1M records per query, do you think
I should change the
>>>>>>>>>    BatchScanner buffers somehow?
>>>>>>>>>    2. anything else to improve the scan speed? Again,
I don't
>>>>>>>>>    care about the ingestion time
>>>>>>>>>
>>>>>>>>> Thanks for the help!
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Mario Pastorelli | TERALYTICS
>>>>>>>>>
>>>>>>>>> *software engineer*
>>>>>>>>>
>>>>>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>>>>>>> phone: +41794381682
>>>>>>>>> email: mario.pastorelli@teralytics.ch
>>>>>>>>> www.teralytics.net
>>>>>>>>>
>>>>>>>>> Company registration number: CH-020.3.037.709-7 | Trade
register
>>>>>>>>> Canton Zurich
>>>>>>>>> Board of directors: Georg Polzer, Luciano Franceschina,
Mark
>>>>>>>>> Schmitz, Yann de Vries
>>>>>>>>>
>>>>>>>>> This e-mail message contains confidential information
which is for
>>>>>>>>> the sole attention and use of the intended recipient.
Please notify us at
>>>>>>>>> once if you think that it may not be intended for you
and delete it
>>>>>>>>> immediately.
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Mario Pastorelli | TERALYTICS
>>>>>>>
>>>>>>> *software engineer*
>>>>>>>
>>>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>>>>> phone: +41794381682
>>>>>>> email: mario.pastorelli@teralytics.ch
>>>>>>> www.teralytics.net
>>>>>>>
>>>>>>> Company registration number: CH-020.3.037.709-7 | Trade register
>>>>>>> Canton Zurich
>>>>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark
>>>>>>> Schmitz, Yann de Vries
>>>>>>>
>>>>>>> This e-mail message contains confidential information which is
for
>>>>>>> the sole attention and use of the intended recipient. Please
notify us at
>>>>>>> once if you think that it may not be intended for you and delete
it
>>>>>>> immediately.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mario Pastorelli | TERALYTICS
>>>>>
>>>>> *software engineer*
>>>>>
>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>>> phone: +41794381682
>>>>> email: mario.pastorelli@teralytics.ch
>>>>> www.teralytics.net
>>>>>
>>>>> Company registration number: CH-020.3.037.709-7 | Trade register
>>>>> Canton Zurich
>>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>>>>> Yann de Vries
>>>>>
>>>>> This e-mail message contains confidential information which is for the
>>>>> sole attention and use of the intended recipient. Please notify us at
once
>>>>> if you think that it may not be intended for you and delete it immediately.
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Mario Pastorelli | TERALYTICS
>>>
>>> *software engineer*
>>>
>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>> phone: +41794381682
>>> email: mario.pastorelli@teralytics.ch
>>> www.teralytics.net
>>>
>>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>>> Zurich
>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>>> Yann de Vries
>>>
>>> This e-mail message contains confidential information which is for the
>>> sole attention and use of the intended recipient. Please notify us at once
>>> if you think that it may not be intended for you and delete it immediately.
>>>
>>

Mime
View raw message