accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Optimize Accumulo scan speed
Date Mon, 11 Apr 2016 03:38:28 GMT


Mario Pastorelli wrote:
> Hi,
>
> I'm currently having some scan speed issues with Accumulo and I would
> like to understand why and how can I solve it. I have geographical data
> and I use as primary key the day and then the geohex, which is a
> linearisation of lat and lon. The reason for this key is that I always
> query the data for one day but for a set of geohexes with represent a
> zone, so with this schema I can scan use a single scan to read all the
> data for one day with few seeks. My problem is that the scan is
> painfully slow: for instance, to read 5617019 rows it takes around 17
> seconds and the scan speed is 13MB/s, less than 750k scan entries/s and
> around 300 seeks. I enable the tracer and this is what I've got

13MB/s sounds like you're only actually querying one TabletServer. Dave 
and Andrew hit the nail on the head suggesting some sharding on the 
rowId. That will help get more servers involved in servicing your query.

You can also try turning on TRACE logging via log4j on 
org.apache.accumulo.core.client.impl. That should give you some insight 
about what the client is actually doing WRT RPCs.

> 17325+0 Dice@srv1 Dice.query
> 11+1 Dice@srv1 scan 11+1 Dice@srv1 scan:location
> 5+13 Dice@srv1 scan 5+13 Dice@srv1 scan:location
> 4+19 Dice@srv1 scan 4+19 Dice@srv1 scan:location
> 5+23 Dice@srv1 scan 4+24 Dice@srv1 scan:location
> I'm not sure how to speedup the scanning. I have the following question:
>    - is this speed normal?
>    - can I involve more servers in the scan? Right now only two server
> have the ranges but with a cluster of 15 machines it would be nice to
> involve more of them. Is it possible?
>
> Thanks,
> Mario


Mime
View raw message