accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mario Pastorelli <mario.pastore...@teralytics.ch>
Subject Re: Optimize Accumulo scan speed
Date Mon, 11 Apr 2016 09:08:19 GMT
1. tablets are spread out evenly. 34 tablets per server to be precise and
more or less the same amount of entries per server
2. number of threads is set to the minimum between the number of ranges
that I have and the number of servers that I have. Not sure if this makes
sense
3. we didn't set table.scan.max.memory so I think Accumulo is using the
default. We also have powerful machines so probably we should configure
this. Any hint about how to set this?

I've tried to day a linear scan of my table from one random entry and I got
very bad performance. To read 10000000 entries it takes 35018 with scan
around 10MB/s and surprisingly a lot of seeks (~500). Considering this is a
linear scan, I'm not sure why there are so many seeks. I'm using the
following code:

def testScanSpeed(numEntries: Long): (LocalDate, Long, Long, Long) = {
    require(numEntries >= 0, s"Number of entries to be scanned must be >=
0")
    val numDays = Days.daysBetween(firstValidDate, endValidDate).getDays()
    val randomDay = firstValidDate plusDays Random.nextInt(numDays)
    val randomHexId = Random.nextLong()
    val startKey = firstKeyForDayHexId(randomDay, randomHexId)
    val range = new Range(startKey, null.asInstanceOf[Key], true, false,
false, true)
    val scanner = this.accumulo.connector.createScanner(tableName,
this.accumuloAuthorizations)
    scanner.setRange(range)
    val iterator = scanner.iterator()
    var numRead = 0L
    DistributedTrace.enable(this.tracerHostname, "Dice")
    val traceScope = Trace.startSpan("Dice.testScanSpeed", Sampler.ALWAYS)
    while (numRead < numEntries && iterator.hasNext) {
      val _ = iterator.next()
      numRead += 1L
    }
    val elapsedMillis = traceScope.getSpan.getAccumulatedMillis
    traceScope.close()
    DistributedTrace.disable()
    scanner.close()
    (randomDay, randomHexId, numRead, elapsedMillis)
  }

On Sun, Apr 10, 2016 at 11:47 PM, <dlmarion@comcast.net> wrote:

> Some other thoughts in addition to the sharding:
>
>
>
> 1. Are your tablets spread out evenly across your tablet servers?
>
> 2. How many threads are you using in your batch scanner?
>
> 3. What is the table.scan.max.memory setting?
>
>
>
> *From:* Andrew Hulbert [mailto:ahulbert@ccri.com]
> *Sent:* Sunday, April 10, 2016 1:01 PM
> *To:* user@accumulo.apache.org
> *Subject:* Re: Optimize Accumulo scan speed
>
>
>
> I wonder if doing a full compaction on the table in the shell might help
> some as well...though I don't know it will vastly increase performance. The
> other option is lowing the split size for tablets for more parallelism but
> that probably isn't scalable.
>
> Back to the original query plan, I wonder if the 300 seeks could be
> reduced some how by forming tighter ranges...are you able to get any timing
> on a scan of a range without the seeks?
>
> On 04/10/2016 12:47 PM, Mario Pastorelli wrote:
>
> I'm using a BatchScanner because I don't care about the order.
>
> The sharding is indeed a good idea which I've already tested in the past.
> The only problem that I've found with it is that there is no way to be sure
> that the n ranges will be evenly distributed among the n machines. Tablets
> are mapped to blocks and HDFS decides where to put them so you could end up
> with two or more tablets of the same range but different shards put on the
> same machine and disk.
>
> Anyway, performance were better than not having sharding, so I will
> reenable it and do some tests with the number of shards.
>
>
>
> On Sun, Apr 10, 2016 at 5:25 PM, Andrew Hulbert <ahulbert@ccri.com> wrote:
>
> Mario,
>
> Are you using a Scanner or a BatchScanner?
>
> One thing we did in the past with a geohash-based schema was to prefix a
> shard ID in front of the geohash that allows you to involve all the
> tservers in the scan. You'd multiply your ranges by the number of tservers
> you have but if the client is not the bottleneck then it may increase your
> throughput.
>
> Andrew
>
>
>
> On 04/10/2016 11:05 AM, Mario Pastorelli wrote:
>
> Hi,
>
> I'm currently having some scan speed issues with Accumulo and I would like
> to understand why and how can I solve it. I have geographical data and I
> use as primary key the day and then the geohex, which is a linearisation of
> lat and lon. The reason for this key is that I always query the data for
> one day but for a set of geohexes with represent a zone, so with this
> schema I can scan use a single scan to read all the data for one day with
> few seeks. My problem is that the scan is painfully slow: for instance, to
> read 5617019 rows it takes around 17 seconds and the scan speed is 13MB/s,
> less than 750k scan entries/s and around 300 seeks. I enable the tracer and
> this is what I've got
>
> 17325+0 Dice@srv1 Dice.query
> 11+1 Dice@srv1 scan 11+1 Dice@srv1 scan:location
> 5+13 Dice@srv1 scan 5+13 Dice@srv1 scan:location
> 4+19 Dice@srv1 scan 4+19 Dice@srv1 scan:location
> 5+23 Dice@srv1 scan 4+24 Dice@srv1 scan:location
>
> I'm not sure how to speedup the scanning. I have the following question:
>
>   - is this speed normal?
>
>   - can I involve more servers in the scan? Right now only two server have
> the ranges but with a cluster of 15 machines it would be nice to involve
> more of them. Is it possible?
>
> Thanks,
>
> Mario
>
>
>
> --
>
> Mario Pastorelli | TERALYTICS
>
> *software engineer*
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone: +41794381682
> email: mario.pastorelli@teralytics.ch
> www.teralytics.net
>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
> de Vries
>
> This e-mail message contains confidential information which is for the
> sole attention and use of the intended recipient. Please notify us at once
> if you think that it may not be intended for you and delete it immediately.
>
>
>
>
>
>
> --
>
> Mario Pastorelli | TERALYTICS
>
> *software engineer*
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone: +41794381682
> email: mario.pastorelli@teralytics.ch
> www.teralytics.net
>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
> de Vries
>
> This e-mail message contains confidential information which is for the
> sole attention and use of the intended recipient. Please notify us at once
> if you think that it may not be intended for you and delete it immediately.
>
>
>



-- 
Mario Pastorelli | TERALYTICS

*software engineer*

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
phone: +41794381682
email: mario.pastorelli@teralytics.ch
www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton
Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
de Vries

This e-mail message contains confidential information which is for the sole
attention and use of the intended recipient. Please notify us at once if
you think that it may not be intended for you and delete it immediately.

Mime
View raw message