Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id CADEB200B8C for ; Mon, 12 Sep 2016 20:05:27 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id C9C4A160AB8; Mon, 12 Sep 2016 18:05:27 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C50F2160AB2 for ; Mon, 12 Sep 2016 20:05:26 +0200 (CEST) Received: (qmail 6031 invoked by uid 500); 12 Sep 2016 18:05:25 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 6021 invoked by uid 99); 12 Sep 2016 18:05:25 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Sep 2016 18:05:25 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 808FC1A7223 for ; Mon, 12 Sep 2016 18:05:25 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.72 X-Spam-Level: X-Spam-Status: No, score=-0.72 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=deenlo-com.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id nxp1FKpddoxM for ; Mon, 12 Sep 2016 18:05:23 +0000 (UTC) Received: from mail-oi0-f49.google.com (mail-oi0-f49.google.com [209.85.218.49]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 628535F366 for ; Mon, 12 Sep 2016 18:05:22 +0000 (UTC) Received: by mail-oi0-f49.google.com with SMTP id y2so327141825oie.0 for ; Mon, 12 Sep 2016 11:05:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=deenlo-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=n8j8T8tzLtgVckvrl56d+6Esj0mqTiastXQ6K5Bf2xg=; b=VLhR6/krvkMBSE+mbdLsQOctQ6vqJLukWUirgl6SZQT8uo7e7O1M79EtP7EeP4sZPU 5aXly1SwfTXzicI2WTt4Fvxc6k24nHm060b1pZJX/a8CfnpVJFFgghoiqS+uDewP0sVb lZyExZDqBSD3pbw7pIvFrisOB2bB6atG+YVwXWJb7xeAsA2qDzPM4p8n4JcMeKbcoxlZ rggBPJOlLAmGo0rWBW1BJilGvyI0YkoJUJ4EvvPCLRL7smA8jm3XXmVcOopRdI1alB/l Bm7yhRDetdaxdDFUb7ivoP1hqq1zpH3Gc9bRijDkIJJGsQetV0Q+VR4YxKuribNgWRjS M2dg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=n8j8T8tzLtgVckvrl56d+6Esj0mqTiastXQ6K5Bf2xg=; b=IH+wNS7jSDnaOafcBfYPlUUCb/uzjjnol6hNzHmTWwfMTyZ8F7Gb7qfRjndMqNKa6w PCG2CJbTeVHI3BYeoo0B1cN3iWBYH/3y5puzUblSaJ3jQTkgRnHK7zeH+VfEhplQy6Co z46bSWXxzZTpCXh24u62GS2003+KqyHaaLJds9EKjWTphb039HWRGAl1pg+GDRGVDyan rkdXnJzc8ftZVM50HnZPOME/9UkwWCgXWR11GRizgAiBCBcqpDxFlzd4Q60mfgu7XcAG /qU+DztUEHZNqOu5u/c5wbgypFaM37gWulhgxtU5ILFPkrlwG/bKxFqBCJnCemLA5Utv RzPw== X-Gm-Message-State: AE9vXwM+FfY6yncI2B1FODodNdAqowP+QfCh+8mIq/RgOc8szXLsp9gxJgvqBffKMk3LUe3Ft+3RmcwDyI1yBA== X-Received: by 10.202.92.197 with SMTP id q188mr327247oib.79.1473703520923; Mon, 12 Sep 2016 11:05:20 -0700 (PDT) MIME-Version: 1.0 Received: by 10.202.171.215 with HTTP; Mon, 12 Sep 2016 11:05:19 -0700 (PDT) In-Reply-To: References: <393687421.762328.1472044939599.JavaMail.zimbra@scai.fraunhofer.de> <2089159696.944715.1472627172729.JavaMail.zimbra@scai.fraunhofer.de> <57D482A8.7030507@gmail.com> From: Keith Turner Date: Mon, 12 Sep 2016 14:05:19 -0400 Message-ID: Subject: Re: Accumulo Seek performance To: user@accumulo.apache.org Content-Type: text/plain; charset=UTF-8 archived-at: Mon, 12 Sep 2016 18:05:28 -0000 Note I was running a single tserver, datanode, and zookeeper on my workstation. On Mon, Sep 12, 2016 at 2:02 PM, Keith Turner wrote: > Josh helped me get up and running w/ YCSB and I Am seeing very > different results. I am going to make a pull req to Josh's GH repo > to add a Readme w/ what I learned from Josh in IRC. > > The link below is the Accumulo config I used for running a local 1.8.0 instance. > > https://gist.github.com/keith-turner/4678a0aac2a2a0e240ea5d73285743ab > > I created splits user1~ user2~ user3~ user4~ user5~ user6~ user7~ > user8~ user9~ AND then compacted the table. > > Below is the performance I saw with a single batch scanner (configured > 1 partition). The batch scanner has 10 threads. > > 2016-09-12 12:36:41,079 [client.ClientConfiguration] WARN : Found no > client.conf in default paths. Using default client configuration > values. > 2016-09-12 12:36:41,428 [joshelser.YcsbBatchScanner] INFO : Connected > to Accumulo > 2016-09-12 12:36:41,429 [joshelser.YcsbBatchScanner] INFO : Computing ranges > 2016-09-12 12:36:48,059 [joshelser.YcsbBatchScanner] INFO : Calculated > all rows: Found 1000000 rows > 2016-09-12 12:36:48,096 [joshelser.YcsbBatchScanner] INFO : Shuffled all rows > 2016-09-12 12:36:48,116 [joshelser.YcsbBatchScanner] INFO : All ranges > calculated: 3000 ranges found > 2016-09-12 12:36:48,118 [joshelser.YcsbBatchScanner] INFO : Executing > 1 range partitions using a pool of 1 threads > 2016-09-12 12:36:49,372 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 1252 ms > 2016-09-12 12:36:49,372 [joshelser.YcsbBatchScanner] INFO : Executing > 1 range partitions using a pool of 1 threads > 2016-09-12 12:36:50,561 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 1188 ms > 2016-09-12 12:36:50,561 [joshelser.YcsbBatchScanner] INFO : Executing > 1 range partitions using a pool of 1 threads > 2016-09-12 12:36:51,741 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 1179 ms > 2016-09-12 12:36:51,741 [joshelser.YcsbBatchScanner] INFO : Executing > 1 range partitions using a pool of 1 threads > 2016-09-12 12:36:52,974 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 1233 ms > 2016-09-12 12:36:52,974 [joshelser.YcsbBatchScanner] INFO : Executing > 1 range partitions using a pool of 1 threads > 2016-09-12 12:36:54,146 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 1171 ms > > Below is the performance I saw with 6 batch scanners. Each batch > scanner has 10 threads. > > 2016-09-12 13:58:21,061 [client.ClientConfiguration] WARN : Found no > client.conf in default paths. Using default client configuration > values. > 2016-09-12 13:58:21,380 [joshelser.YcsbBatchScanner] INFO : Connected > to Accumulo > 2016-09-12 13:58:21,381 [joshelser.YcsbBatchScanner] INFO : Computing ranges > 2016-09-12 13:58:28,571 [joshelser.YcsbBatchScanner] INFO : Calculated > all rows: Found 1000000 rows > 2016-09-12 13:58:28,606 [joshelser.YcsbBatchScanner] INFO : Shuffled all rows > 2016-09-12 13:58:28,632 [joshelser.YcsbBatchScanner] INFO : All ranges > calculated: 3000 ranges found > 2016-09-12 13:58:28,634 [joshelser.YcsbBatchScanner] INFO : Executing > 6 range partitions using a pool of 6 threads > 2016-09-12 13:58:30,273 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 1637 ms > 2016-09-12 13:58:30,273 [joshelser.YcsbBatchScanner] INFO : Executing > 6 range partitions using a pool of 6 threads > 2016-09-12 13:58:31,883 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 1609 ms > 2016-09-12 13:58:31,883 [joshelser.YcsbBatchScanner] INFO : Executing > 6 range partitions using a pool of 6 threads > 2016-09-12 13:58:33,422 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 1539 ms > 2016-09-12 13:58:33,422 [joshelser.YcsbBatchScanner] INFO : Executing > 6 range partitions using a pool of 6 threads > 2016-09-12 13:58:34,994 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 1571 ms > 2016-09-12 13:58:34,994 [joshelser.YcsbBatchScanner] INFO : Executing > 6 range partitions using a pool of 6 threads > 2016-09-12 13:58:36,512 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 1517 ms > > Below is the performance I saw with 6 threads each using a scanner. > > 2016-09-12 14:01:14,972 [client.ClientConfiguration] WARN : Found no > client.conf in default paths. Using default client configuration > values. > 2016-09-12 14:01:15,287 [joshelser.YcsbBatchScanner] INFO : Connected > to Accumulo > 2016-09-12 14:01:15,288 [joshelser.YcsbBatchScanner] INFO : Computing ranges > 2016-09-12 14:01:22,309 [joshelser.YcsbBatchScanner] INFO : Calculated > all rows: Found 1000000 rows > 2016-09-12 14:01:22,352 [joshelser.YcsbBatchScanner] INFO : Shuffled all rows > 2016-09-12 14:01:22,373 [joshelser.YcsbBatchScanner] INFO : All ranges > calculated: 3000 ranges found > 2016-09-12 14:01:22,376 [joshelser.YcsbBatchScanner] INFO : Executing > 6 range partitions using a pool of 6 threads > 2016-09-12 14:01:25,696 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 3318 ms > 2016-09-12 14:01:25,696 [joshelser.YcsbBatchScanner] INFO : Executing > 6 range partitions using a pool of 6 threads > 2016-09-12 14:01:29,001 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 3305 ms > 2016-09-12 14:01:29,001 [joshelser.YcsbBatchScanner] INFO : Executing > 6 range partitions using a pool of 6 threads > 2016-09-12 14:01:31,824 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 2822 ms > 2016-09-12 14:01:31,824 [joshelser.YcsbBatchScanner] INFO : Executing > 6 range partitions using a pool of 6 threads > 2016-09-12 14:01:34,207 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 2383 ms > 2016-09-12 14:01:34,207 [joshelser.YcsbBatchScanner] INFO : Executing > 6 range partitions using a pool of 6 threads > 2016-09-12 14:01:36,548 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 2340 ms > > On Sat, Sep 10, 2016 at 6:01 PM, Josh Elser wrote: >> Sven, et al: >> >> So, it would appear that I have been able to reproduce this one (better late >> than never, I guess...). tl;dr Serially using Scanners to do point lookups >> instead of a BatchScanner is ~20x faster. This sounds like a pretty serious >> performance issue to me. >> >> Here's a general outline for what I did. >> >> * Accumulo 1.8.0 >> * Created a table with 1M rows, each row with 10 columns using YCSB >> (workloada) >> * Split the table into 9 tablets >> * Computed the set of all rows in the table >> >> For a number of iterations: >> * Shuffle this set of rows >> * Choose the first N rows >> * Construct an equivalent set of Ranges from the set of Rows, choosing a >> random column (0-9) >> * Partition the N rows into X collections >> * Submit X tasks to query one partition of the N rows (to a thread pool with >> X fixed threads) >> >> I have two implementations of these tasks. One, where all ranges in a >> partition are executed via one BatchWriter. A second where each range is >> executed in serial using a Scanner. The numbers speak for themselves. >> >> ** BatchScanners ** >> 2016-09-10 17:51:38,811 [joshelser.YcsbBatchScanner] INFO : Shuffled all >> rows >> 2016-09-10 17:51:38,843 [joshelser.YcsbBatchScanner] INFO : All ranges >> calculated: 3000 ranges found >> 2016-09-10 17:51:38,846 [joshelser.YcsbBatchScanner] INFO : Executing 6 >> range partitions using a pool of 6 threads >> 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Queries executed >> in 40178 ms >> 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Executing 6 >> range partitions using a pool of 6 threads >> 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Queries executed >> in 42296 ms >> 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Executing 6 >> range partitions using a pool of 6 threads >> 2016-09-10 17:53:47,414 [joshelser.YcsbBatchScanner] INFO : Queries executed >> in 46094 ms >> 2016-09-10 17:53:47,415 [joshelser.YcsbBatchScanner] INFO : Executing 6 >> range partitions using a pool of 6 threads >> 2016-09-10 17:54:35,118 [joshelser.YcsbBatchScanner] INFO : Queries executed >> in 47704 ms >> 2016-09-10 17:54:35,119 [joshelser.YcsbBatchScanner] INFO : Executing 6 >> range partitions using a pool of 6 threads >> 2016-09-10 17:55:24,339 [joshelser.YcsbBatchScanner] INFO : Queries executed >> in 49221 ms >> >> ** Scanners ** >> 2016-09-10 17:57:23,867 [joshelser.YcsbBatchScanner] INFO : Shuffled all >> rows >> 2016-09-10 17:57:23,898 [joshelser.YcsbBatchScanner] INFO : All ranges >> calculated: 3000 ranges found >> 2016-09-10 17:57:23,903 [joshelser.YcsbBatchScanner] INFO : Executing 6 >> range partitions using a pool of 6 threads >> 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Queries executed >> in 2833 ms >> 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Executing 6 >> range partitions using a pool of 6 threads >> 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Queries executed >> in 2536 ms >> 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Executing 6 >> range partitions using a pool of 6 threads >> 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Queries executed >> in 2150 ms >> 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Executing 6 >> range partitions using a pool of 6 threads >> 2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Queries executed >> in 2061 ms >> 2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Executing 6 >> range partitions using a pool of 6 threads >> 2016-09-10 17:57:35,628 [joshelser.YcsbBatchScanner] INFO : Queries executed >> in 2140 ms >> >> Query code is available https://github.com/joshelser/accumulo-range-binning >> >> >> Sven Hodapp wrote: >>> >>> Hi Keith, >>> >>> I've tried it with 1, 2 or 10 threads. Unfortunately there where no >>> amazing differences. >>> Maybe it's a problem with the table structure? For example it may happen >>> that one row id (e.g. a sentence) has several thousand column families. Can >>> this affect the seek performance? >>> >>> So for my initial example it has about 3000 row ids to seek, which will >>> return about 500k entries. If I filter for specific column families (e.g. a >>> document without annotations) it will return about 5k entries, but the seek >>> time will only be halved. >>> Are there to much column families to seek it fast? >>> >>> Thanks! >>> >>> Regards, >>> Sven >>> >>