accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Reichman <mreich...@pixelforensics.com>
Subject Re: Feedback about techniques for tuning batch scanning for my problem
Date Thu, 19 May 2016 15:53:57 GMT
Hi Mario,

Not sure where this plays into your data integrity, but have you looked
into these settings in hdfs-site.xml?
dfs.client.read.shortcircuit
dfs.client.read.shortcircuit.skip.checksum
dfs.domain.socket.path

These make for a somewhat dramatic increase in HDFS read performance if
data is distributed well enough around..

I can't speak as much to the scanner params, but you may look into these as
well.

Marc

On Thu, May 19, 2016 at 10:08 AM, Mario Pastorelli <
mario.pastorelli@teralytics.ch> wrote:

> Hey people,
> I'm trying to tune a bit the query performance to see how fast it can go
> and I thought it would be great to have comments from the community. The
> problem that I'm trying to solve in Accumulo is the following: we want to
> store the entities that have been in a certain location in a certain day.
> The location is a Long and the entity id is a Long. I want to be able to
> scan ~1M of rows in few seconds, possibly less than one. Right now, I'm
> doing the following things:
>
>    1. I'm using a sharding byte at the start of the rowId to keep the
>    data in the same range distributed in the cluster
>    2. all the records are encoded, one single record is composed by
>       1. rowId: 1 shard byte + 3 bytes for the day
>       2. column family: 8 byte for the long corresponding to the hash of
>       the location
>       3. column qualifier: 8 byte corresponding to the identifier of the
>       entity
>       4. value: 2 bytes for some additional information
>    3. I use a batch scanner because I don't need sorting and it's faster
>
> As expected, it takes few seconds to scan 1M rows but now I'm wondering if
> I can improve it. My ideas are the following:
>
>    1. set table.compaction.major.ration to 1 because I don't care about
>    the ingestion performance and this should improve the query performance
>    2. pre-split tables to match the number of servers and then use a byte
>    of shard as first byte of the rowId. This should improve both writing and
>    reading the data because both should work in parallel for what I understood
>    3. enable bloom filter on the table
>
> Do you think those ideas make sense? Furthermore, I have two questions:
>
>    1. considering that a single entry is only 22 bytes but I'm going to
>    scan ~1M records per query, do you think I should change the BatchScanner
>    buffers somehow?
>    2. anything else to improve the scan speed? Again, I don't care about
>    the ingestion time
>
> Thanks for the help!
>
> --
> Mario Pastorelli | TERALYTICS
>
> *software engineer*
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone: +41794381682
> email: mario.pastorelli@teralytics.ch
> www.teralytics.net
>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
> de Vries
>
> This e-mail message contains confidential information which is for the
> sole attention and use of the intended recipient. Please notify us at once
> if you think that it may not be intended for you and delete it immediately.
>

Mime
View raw message