Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 97856200B80 for ; Wed, 14 Sep 2016 15:53:04 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 96233160ABA; Wed, 14 Sep 2016 13:53:04 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 6808F160AB4 for ; Wed, 14 Sep 2016 15:53:03 +0200 (CEST) Received: (qmail 73117 invoked by uid 500); 14 Sep 2016 13:53:02 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 73107 invoked by uid 99); 14 Sep 2016 13:53:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Sep 2016 13:53:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id E087EC0C0A for ; Wed, 14 Sep 2016 13:53:01 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.474 X-Spam-Level: X-Spam-Status: No, score=0.474 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RP_MATCHES_RCVD=-1.426] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=cs.washington.edu Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 82iy966Ho7rz for ; Wed, 14 Sep 2016 13:52:58 +0000 (UTC) Received: from mx5.cs.washington.edu (mx5.cs.washington.edu [128.208.2.106]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id DF3595FDC1 for ; Wed, 14 Sep 2016 13:52:57 +0000 (UTC) Received: from mx5.cs.washington.edu (localhost [127.0.0.1]) by mx5.cs.washington.edu (8.15.2/8.15.2/1.18) with ESMTP id u8EDqpxs020641 for ; Wed, 14 Sep 2016 06:52:51 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.washington.edu; s=csw201206; t=1473861171; bh=NGRmvhFp3tqI5Gkf+awEPj0UUxpINhWHzKVvHHTqJ24=; h=In-Reply-To:References:From:Date:Subject:To; b=2611uGmo6JNf5pATBz5QDD15cXeWv6Pf+525e+tuIzEpegEgVSKxrftsJ7tsvkcPq cLU1w5tlp+NkUznzRje0Cu5bTAsdxVpn4hS0kmyNpJbG1dYRIeA5kzlivKqqw8cfs8 ZKgnJt+ZxjxHTaS5N0lbfFzW1CKReudZiS3+Dk+s= Received: from mail-yw0-f180.google.com (mail-yw0-f180.google.com [209.85.161.180]) (authenticated bits=0) by mx5.cs.washington.edu (8.15.2/8.15.2/1.18) with ESMTPSA id u8EDqhor020582 (version=TLSv1.2 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Wed, 14 Sep 2016 06:52:44 -0700 Received: by mail-yw0-f180.google.com with SMTP id g192so20716870ywh.1 for ; Wed, 14 Sep 2016 06:52:44 -0700 (PDT) X-Gm-Message-State: AE9vXwMsJsgilHidTOwri0yUfd0YLk5msnQZXOYrNoekNyu+8cViVmX4imEDVN8DwZQxP1eatCEjovLHaofUfQ== X-Received: by 10.129.81.71 with SMTP id f68mr2633003ywb.223.1473861163632; Wed, 14 Sep 2016 06:52:43 -0700 (PDT) MIME-Version: 1.0 Received: by 10.159.37.97 with HTTP; Wed, 14 Sep 2016 06:52:42 -0700 (PDT) Received: by 10.159.37.97 with HTTP; Wed, 14 Sep 2016 06:52:42 -0700 (PDT) In-Reply-To: <57D8C156.4070405@gmail.com> References: <393687421.762328.1472044939599.JavaMail.zimbra@scai.fraunhofer.de> <2089159696.944715.1472627172729.JavaMail.zimbra@scai.fraunhofer.de> <57D482A8.7030507@gmail.com> <57D82FE4.8080103@gmail.com> <57D8C156.4070405@gmail.com> From: Dylan Hutchison Date: Wed, 14 Sep 2016 06:52:42 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Accumulo Seek performance To: Accumulo User List Content-Type: multipart/alternative; boundary=001a11458c3e1bde38053c780b8f X-Uwcse-Spam-Status: No, score=0.0 required=5.0 tests=HTML_MESSAGE,NO_RELAYS autolearn=disabled version=3.4.1-20160122 X-Uwcse-Spam-Checker-Version: SpamAssassin 3.4.1-20160122 (2015-04-28) on mx5.cs.washington.edu archived-at: Wed, 14 Sep 2016 13:53:04 -0000 --001a11458c3e1bde38053c780b8f Content-Type: text/plain; charset=UTF-8 Do we have a (hopefully reproducible) conclusion from this thread, regarding Scanners and BatchScanners? On Sep 13, 2016 11:17 PM, "Josh Elser" wrote: > Yeah, this seems to have been osx causing me grief. > > Spun up a 3tserver cluster (on openstack, even) and reran the same > experiment. I could not reproduce the issues, even without substantial > config tweaking. > > Josh Elser wrote: > >> I'm playing around with this a little more today and something is >> definitely weird on my local machine. I'm seeing insane spikes in >> performance using Scanners too. >> >> Coupled with Keith's inability to repro this, I am starting to think >> that these are not worthwhile numbers to put weight behind. Something I >> haven't been able to figure out is quite screwy for me. >> >> Josh Elser wrote: >> >>> Sven, et al: >>> >>> So, it would appear that I have been able to reproduce this one (better >>> late than never, I guess...). tl;dr Serially using Scanners to do point >>> lookups instead of a BatchScanner is ~20x faster. This sounds like a >>> pretty serious performance issue to me. >>> >>> Here's a general outline for what I did. >>> >>> * Accumulo 1.8.0 >>> * Created a table with 1M rows, each row with 10 columns using YCSB >>> (workloada) >>> * Split the table into 9 tablets >>> * Computed the set of all rows in the table >>> >>> For a number of iterations: >>> * Shuffle this set of rows >>> * Choose the first N rows >>> * Construct an equivalent set of Ranges from the set of Rows, choosing a >>> random column (0-9) >>> * Partition the N rows into X collections >>> * Submit X tasks to query one partition of the N rows (to a thread pool >>> with X fixed threads) >>> >>> I have two implementations of these tasks. One, where all ranges in a >>> partition are executed via one BatchWriter. A second where each range is >>> executed in serial using a Scanner. The numbers speak for themselves. >>> >>> ** BatchScanners ** >>> 2016-09-10 17:51:38,811 [joshelser.YcsbBatchScanner] INFO : Shuffled all >>> rows >>> 2016-09-10 17:51:38,843 [joshelser.YcsbBatchScanner] INFO : All ranges >>> calculated: 3000 ranges found >>> 2016-09-10 17:51:38,846 [joshelser.YcsbBatchScanner] INFO : Executing 6 >>> range partitions using a pool of 6 threads >>> 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Queries >>> executed in 40178 ms >>> 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Executing 6 >>> range partitions using a pool of 6 threads >>> 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Queries >>> executed in 42296 ms >>> 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Executing 6 >>> range partitions using a pool of 6 threads >>> 2016-09-10 17:53:47,414 [joshelser.YcsbBatchScanner] INFO : Queries >>> executed in 46094 ms >>> 2016-09-10 17:53:47,415 [joshelser.YcsbBatchScanner] INFO : Executing 6 >>> range partitions using a pool of 6 threads >>> 2016-09-10 17:54:35,118 [joshelser.YcsbBatchScanner] INFO : Queries >>> executed in 47704 ms >>> 2016-09-10 17:54:35,119 [joshelser.YcsbBatchScanner] INFO : Executing 6 >>> range partitions using a pool of 6 threads >>> 2016-09-10 17:55:24,339 [joshelser.YcsbBatchScanner] INFO : Queries >>> executed in 49221 ms >>> >>> ** Scanners ** >>> 2016-09-10 17:57:23,867 [joshelser.YcsbBatchScanner] INFO : Shuffled all >>> rows >>> 2016-09-10 17:57:23,898 [joshelser.YcsbBatchScanner] INFO : All ranges >>> calculated: 3000 ranges found >>> 2016-09-10 17:57:23,903 [joshelser.YcsbBatchScanner] INFO : Executing 6 >>> range partitions using a pool of 6 threads >>> 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Queries >>> executed in 2833 ms >>> 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Executing 6 >>> range partitions using a pool of 6 threads >>> 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Queries >>> executed in 2536 ms >>> 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Executing 6 >>> range partitions using a pool of 6 threads >>> 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Queries >>> executed in 2150 ms >>> 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Executing 6 >>> range partitions using a pool of 6 threads >>> 2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Queries >>> executed in 2061 ms >>> 2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Executing 6 >>> range partitions using a pool of 6 threads >>> 2016-09-10 17:57:35,628 [joshelser.YcsbBatchScanner] INFO : Queries >>> executed in 2140 ms >>> >>> Query code is available >>> https://github.com/joshelser/accumulo-range-binning >>> >>> Sven Hodapp wrote: >>> >>>> Hi Keith, >>>> >>>> I've tried it with 1, 2 or 10 threads. Unfortunately there where no >>>> amazing differences. >>>> Maybe it's a problem with the table structure? For example it may >>>> happen that one row id (e.g. a sentence) has several thousand column >>>> families. Can this affect the seek performance? >>>> >>>> So for my initial example it has about 3000 row ids to seek, which >>>> will return about 500k entries. If I filter for specific column >>>> families (e.g. a document without annotations) it will return about 5k >>>> entries, but the seek time will only be halved. >>>> Are there to much column families to seek it fast? >>>> >>>> Thanks! >>>> >>>> Regards, >>>> Sven >>>> >>>> --001a11458c3e1bde38053c780b8f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

Do we have a (hopefully reproducible) conclusion from this t= hread, regarding Scanners and BatchScanners?


On Sep 13, 2016 1= 1:17 PM, "Josh Elser" <josh.elser@gmail.com> wrote:
Yeah, this seems to have been osx causing me grief.

Spun up a 3tserver cluster (on openstack, even) and reran the same experime= nt. I could not reproduce the issues, even without substantial config tweak= ing.

Josh Elser wrote:
I'm playing around with this a little more today and something is
definitely weird on my local machine. I'm seeing insane spikes in
performance using Scanners too.

Coupled with Keith's inability to repro this, I am starting to think that these are not worthwhile numbers to put weight behind. Something I
haven't been able to figure out is quite screwy for me.

Josh Elser wrote:
Sven, et al:

So, it would appear that I have been able to reproduce this one (better
late than never, I guess...). tl;dr Serially using Scanners to do point
lookups instead of a BatchScanner is ~20x faster. This sounds like a
pretty serious performance issue to me.

Here's a general outline for what I did.

* Accumulo 1.8.0
* Created a table with 1M rows, each row with 10 columns using YCSB
(workloada)
* Split the table into 9 tablets
* Computed the set of all rows in the table

For a number of iterations:
* Shuffle this set of rows
* Choose the first N rows
* Construct an equivalent set of Ranges from the set of Rows, choosing a random column (0-9)
* Partition the N rows into X collections
* Submit X tasks to query one partition of the N rows (to a thread pool
with X fixed threads)

I have two implementations of these tasks. One, where all ranges in a
partition are executed via one BatchWriter. A second where each range is executed in serial using a Scanner. The numbers speak for themselves.

** BatchScanners **
2016-09-10 17:51:38,811 [joshelser.YcsbBatchScanner] INFO : Shuffled all rows
2016-09-10 17:51:38,843 [joshelser.YcsbBatchScanner] INFO : All ranges
calculated: 3000 ranges found
2016-09-10 17:51:38,846 [joshelser.YcsbBatchScanner] INFO : Executing 6
range partitions using a pool of 6 threads
2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Queries
executed in 40178 ms
2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Executing 6
range partitions using a pool of 6 threads
2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Queries
executed in 42296 ms
2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Executing 6
range partitions using a pool of 6 threads
2016-09-10 17:53:47,414 [joshelser.YcsbBatchScanner] INFO : Queries
executed in 46094 ms
2016-09-10 17:53:47,415 [joshelser.YcsbBatchScanner] INFO : Executing 6
range partitions using a pool of 6 threads
2016-09-10 17:54:35,118 [joshelser.YcsbBatchScanner] INFO : Queries
executed in 47704 ms
2016-09-10 17:54:35,119 [joshelser.YcsbBatchScanner] INFO : Executing 6
range partitions using a pool of 6 threads
2016-09-10 17:55:24,339 [joshelser.YcsbBatchScanner] INFO : Queries
executed in 49221 ms

** Scanners **
2016-09-10 17:57:23,867 [joshelser.YcsbBatchScanner] INFO : Shuffled all rows
2016-09-10 17:57:23,898 [joshelser.YcsbBatchScanner] INFO : All ranges
calculated: 3000 ranges found
2016-09-10 17:57:23,903 [joshelser.YcsbBatchScanner] INFO : Executing 6
range partitions using a pool of 6 threads
2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Queries
executed in 2833 ms
2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Executing 6
range partitions using a pool of 6 threads
2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Queries
executed in 2536 ms
2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Executing 6
range partitions using a pool of 6 threads
2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Queries
executed in 2150 ms
2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Executing 6
range partitions using a pool of 6 threads
2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Queries
executed in 2061 ms
2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Executing 6
range partitions using a pool of 6 threads
2016-09-10 17:57:35,628 [joshelser.YcsbBatchScanner] INFO : Queries
executed in 2140 ms

Query code is available
https://github.com/joshelser/accumulo-range-= binning

Sven Hodapp wrote:
Hi Keith,

I've tried it with 1, 2 or 10 threads. Unfortunately there where no
amazing differences.
Maybe it's a problem with the table structure? For example it may
happen that one row id (e.g. a sentence) has several thousand column
families. Can this affect the seek performance?

So for my initial example it has about 3000 row ids to seek, which
will return about 500k entries. If I filter for specific column
families (e.g. a document without annotations) it will return about 5k
entries, but the seek time will only be halved.
Are there to much column families to seek it fast?

Thanks!

Regards,
Sven

--001a11458c3e1bde38053c780b8f--