accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Drew Farris <drew.far...@gmail.com>
Subject Re: Optimize Accumulo scan speed
Date Mon, 11 Apr 2016 17:26:55 GMT
Mario,

If I'm reading that code correctly, it appears that you're using a regular
scanner instead of a batch scanner:

    val scanner = this.accumulo.connector.createScanner(tableName,
this.accumuloAuthorizations)

You likely want to use createBatchScanner(..) instead.

Drew

On Mon, Apr 11, 2016, 5:08 AM Mario Pastorelli <
mario.pastorelli@teralytics.ch> wrote:

> 1. tablets are spread out evenly. 34 tablets per server to be precise and
> more or less the same amount of entries per server
> 2. number of threads is set to the minimum between the number of ranges
> that I have and the number of servers that I have. Not sure if this makes
> sense
> 3. we didn't set table.scan.max.memory so I think Accumulo is using the
> default. We also have powerful machines so probably we should configure
> this. Any hint about how to set this?
>
> I've tried to day a linear scan of my table from one random entry and I
> got very bad performance. To read 10000000 entries it takes 35018 with scan
> around 10MB/s and surprisingly a lot of seeks (~500). Considering this is a
> linear scan, I'm not sure why there are so many seeks. I'm using the
> following code:
>
> def testScanSpeed(numEntries: Long): (LocalDate, Long, Long, Long) = {
>     require(numEntries >= 0, s"Number of entries to be scanned must be >=
> 0")
>     val numDays = Days.daysBetween(firstValidDate, endValidDate).getDays()
>     val randomDay = firstValidDate plusDays Random.nextInt(numDays)
>     val randomHexId = Random.nextLong()
>     val startKey = firstKeyForDayHexId(randomDay, randomHexId)
>     val range = new Range(startKey, null.asInstanceOf[Key], true, false,
> false, true)
>     val scanner = this.accumulo.connector.createScanner(tableName,
> this.accumuloAuthorizations)
>     scanner.setRange(range)
>     val iterator = scanner.iterator()
>     var numRead = 0L
>     DistributedTrace.enable(this.tracerHostname, "Dice")
>     val traceScope = Trace.startSpan("Dice.testScanSpeed", Sampler.ALWAYS)
>     while (numRead < numEntries && iterator.hasNext) {
>       val _ = iterator.next()
>       numRead += 1L
>     }
>     val elapsedMillis = traceScope.getSpan.getAccumulatedMillis
>     traceScope.close()
>     DistributedTrace.disable()
>     scanner.close()
>     (randomDay, randomHexId, numRead, elapsedMillis)
>   }
>
> On Sun, Apr 10, 2016 at 11:47 PM, <dlmarion@comcast.net> wrote:
>
>> Some other thoughts in addition to the sharding:
>>
>>
>>
>> 1. Are your tablets spread out evenly across your tablet servers?
>>
>> 2. How many threads are you using in your batch scanner?
>>
>> 3. What is the table.scan.max.memory setting?
>>
>>
>>
>> *From:* Andrew Hulbert [mailto:ahulbert@ccri.com]
>> *Sent:* Sunday, April 10, 2016 1:01 PM
>> *To:* user@accumulo.apache.org
>> *Subject:* Re: Optimize Accumulo scan speed
>>
>>
>>
>> I wonder if doing a full compaction on the table in the shell might help
>> some as well...though I don't know it will vastly increase performance. The
>> other option is lowing the split size for tablets for more parallelism but
>> that probably isn't scalable.
>>
>> Back to the original query plan, I wonder if the 300 seeks could be
>> reduced some how by forming tighter ranges...are you able to get any timing
>> on a scan of a range without the seeks?
>>
>> On 04/10/2016 12:47 PM, Mario Pastorelli wrote:
>>
>> I'm using a BatchScanner because I don't care about the order.
>>
>> The sharding is indeed a good idea which I've already tested in the past.
>> The only problem that I've found with it is that there is no way to be sure
>> that the n ranges will be evenly distributed among the n machines. Tablets
>> are mapped to blocks and HDFS decides where to put them so you could end up
>> with two or more tablets of the same range but different shards put on the
>> same machine and disk.
>>
>> Anyway, performance were better than not having sharding, so I will
>> reenable it and do some tests with the number of shards.
>>
>>
>>
>> On Sun, Apr 10, 2016 at 5:25 PM, Andrew Hulbert <ahulbert@ccri.com>
>> wrote:
>>
>> Mario,
>>
>> Are you using a Scanner or a BatchScanner?
>>
>> One thing we did in the past with a geohash-based schema was to prefix a
>> shard ID in front of the geohash that allows you to involve all the
>> tservers in the scan. You'd multiply your ranges by the number of tservers
>> you have but if the client is not the bottleneck then it may increase your
>> throughput.
>>
>> Andrew
>>
>>
>>
>> On 04/10/2016 11:05 AM, Mario Pastorelli wrote:
>>
>> Hi,
>>
>> I'm currently having some scan speed issues with Accumulo and I would
>> like to understand why and how can I solve it. I have geographical data and
>> I use as primary key the day and then the geohex, which is a linearisation
>> of lat and lon. The reason for this key is that I always query the data for
>> one day but for a set of geohexes with represent a zone, so with this
>> schema I can scan use a single scan to read all the data for one day with
>> few seeks. My problem is that the scan is painfully slow: for instance, to
>> read 5617019 rows it takes around 17 seconds and the scan speed is 13MB/s,
>> less than 750k scan entries/s and around 300 seeks. I enable the tracer and
>> this is what I've got
>>
>> 17325+0 Dice@srv1 Dice.query
>> 11+1 Dice@srv1 scan 11+1 Dice@srv1 scan:location
>> 5+13 Dice@srv1 scan 5+13 Dice@srv1 scan:location
>> 4+19 Dice@srv1 scan 4+19 Dice@srv1 scan:location
>> 5+23 Dice@srv1 scan 4+24 Dice@srv1 scan:location
>>
>> I'm not sure how to speedup the scanning. I have the following question:
>>
>>   - is this speed normal?
>>
>>   - can I involve more servers in the scan? Right now only two server
>> have the ranges but with a cluster of 15 machines it would be nice to
>> involve more of them. Is it possible?
>>
>> Thanks,
>>
>> Mario
>>
>>
>>
>> --
>>
>> Mario Pastorelli | TERALYTICS
>>
>> *software engineer*
>>
>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>> phone: +41794381682
>> email: mario.pastorelli@teralytics.ch
>> www.teralytics.net
>>
>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>> Zurich
>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>> Yann de Vries
>>
>> This e-mail message contains confidential information which is for the
>> sole attention and use of the intended recipient. Please notify us at once
>> if you think that it may not be intended for you and delete it immediately.
>>
>>
>>
>>
>>
>>
>> --
>>
>> Mario Pastorelli | TERALYTICS
>>
>> *software engineer*
>>
>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>> phone: +41794381682
>> email: mario.pastorelli@teralytics.ch
>> www.teralytics.net
>>
>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>> Zurich
>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>> Yann de Vries
>>
>> This e-mail message contains confidential information which is for the
>> sole attention and use of the intended recipient. Please notify us at once
>> if you think that it may not be intended for you and delete it immediately.
>>
>>
>>
>
>
>
> --
> Mario Pastorelli | TERALYTICS
>
> *software engineer*
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone: +41794381682
> email: mario.pastorelli@teralytics.ch
> www.teralytics.net
>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
> de Vries
>
> This e-mail message contains confidential information which is for the
> sole attention and use of the intended recipient. Please notify us at once
> if you think that it may not be intended for you and delete it immediately.
>

Mime
View raw message