accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Hulbert <ahulb...@ccri.com>
Subject Re: Optimize Accumulo scan speed
Date Sun, 10 Apr 2016 17:00:57 GMT
I wonder if doing a full compaction on the table in the shell might help 
some as well...though I don't know it will vastly increase performance. 
The other option is lowing the split size for tablets for more 
parallelism but that probably isn't scalable.

Back to the original query plan, I wonder if the 300 seeks could be 
reduced some how by forming tighter ranges...are you able to get any 
timing on a scan of a range without the seeks?

On 04/10/2016 12:47 PM, Mario Pastorelli wrote:
> I'm using a BatchScanner because I don't care about the order.
>
> The sharding is indeed a good idea which I've already tested in the 
> past. The only problem that I've found with it is that there is no way 
> to be sure that the n ranges will be evenly distributed among the n 
> machines. Tablets are mapped to blocks and HDFS decides where to put 
> them so you could end up with two or more tablets of the same range 
> but different shards put on the same machine and disk.
>
> Anyway, performance were better than not having sharding, so I will 
> reenable it and do some tests with the number of shards.
>
> On Sun, Apr 10, 2016 at 5:25 PM, Andrew Hulbert <ahulbert@ccri.com 
> <mailto:ahulbert@ccri.com>> wrote:
>
>     Mario,
>
>     Are you using a Scanner or a BatchScanner?
>
>     One thing we did in the past with a geohash-based schema was to
>     prefix a shard ID in front of the geohash that allows you to
>     involve all the tservers in the scan. You'd multiply your ranges
>     by the number of tservers you have but if the client is not the
>     bottleneck then it may increase your throughput.
>
>     Andrew
>
>
>     On 04/10/2016 11:05 AM, Mario Pastorelli wrote:
>>     Hi,
>>
>>     I'm currently having some scan speed issues with Accumulo and I
>>     would like to understand why and how can I solve it. I have
>>     geographical data and I use as primary key the day and then the
>>     geohex, which is a linearisation of lat and lon. The reason for
>>     this key is that I always query the data for one day but for a
>>     set of geohexes with represent a zone, so with this schema I can
>>     scan use a single scan to read all the data for one day with few
>>     seeks. My problem is that the scan is painfully slow: for
>>     instance, to read 5617019 rows it takes around 17 seconds and the
>>     scan speed is 13MB/s, less than 750k scan entries/s and around
>>     300 seeks. I enable the tracer and this is what I've got
>>
>>     17325+0 Dice@srv1 Dice.query
>>     11+1 Dice@srv1 scan 11+1 Dice@srv1 scan:location
>>     5+13 Dice@srv1 scan 5+13 Dice@srv1 scan:location
>>     4+19 Dice@srv1 scan 4+19 Dice@srv1 scan:location
>>     5+23 Dice@srv1 scan 4+24 Dice@srv1 scan:location
>>     I'm not sure how to speedup the scanning. I have the following
>>     question:
>>       - is this speed normal?
>>       - can I involve more servers in the scan? Right now only two
>>     server have the ranges but with a cluster of 15 machines it would
>>     be nice to involve more of them. Is it possible?
>>
>>     Thanks,
>>     Mario
>>
>>     -- 
>>     Mario Pastorelli| TERALYTICS
>>
>>     *software engineer*
>>
>>     Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>     phone:+41794381682 <tel:%2B41794381682>
>>     email: mario.pastorelli@teralytics.ch
>>     <mailto:mario.pastorelli@teralytics.ch>
>>     www.teralytics.net <http://www.teralytics.net/>
>>
>>     Company registration number: CH-020.3.037.709-7 | Trade register
>>     Canton Zurich
>>     Board of directors: Georg Polzer, Luciano Franceschina, Mark
>>     Schmitz, Yann de Vries
>>
>>     This e-mail message contains confidential information which is
>>     for the sole attention and use of the intended recipient. Please
>>     notify us at once if you think that it may not be intended for
>>     you and delete it immediately.
>>
>
>
>
>
> -- 
> Mario Pastorelli| TERALYTICS
>
> *software engineer*
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone:+41794381682
> email: mario.pastorelli@teralytics.ch 
> <mailto:mario.pastorelli@teralytics.ch>
> www.teralytics.net <http://www.teralytics.net/>
>
> Company registration number: CH-020.3.037.709-7 | Trade register 
> Canton Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, 
> Yann de Vries
>
> This e-mail message contains confidential information which is for the 
> sole attention and use of the intended recipient. Please notify us at 
> once if you think that it may not be intended for you and delete it 
> immediately.
>


Mime
View raw message