Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D28A710AE5 for ; Thu, 14 Nov 2013 14:57:38 +0000 (UTC) Received: (qmail 60514 invoked by uid 500); 14 Nov 2013 14:57:29 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 60483 invoked by uid 500); 14 Nov 2013 14:57:29 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 60447 invoked by uid 99); 14 Nov 2013 14:57:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Nov 2013 14:57:22 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of erickerickson@gmail.com designates 209.85.220.181 as permitted sender) Received: from [209.85.220.181] (HELO mail-vc0-f181.google.com) (209.85.220.181) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Nov 2013 14:57:16 +0000 Received: by mail-vc0-f181.google.com with SMTP id lf12so751925vcb.26 for ; Thu, 14 Nov 2013 06:56:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=4gpXNwyxzyuaNxjMJ3SMRvhMzxDS42OcyDBd+s9EugE=; b=Xfv1Zj7c+TaYs2r89Lr9qD3pLy3RxwuTFCIE+KYr+bwVyUQm/27XOfLxnOJGIUK+hb eynSRAkOWs1QvQdvUeD4V4cKHirUrTHMy8Fm6vdGh0GtVOj+HSmZ9tHPv7TDnrfmLogw 7jNrHScPV0JG9sTWZM4hOS0W/67d2h7Y/a306+OeGiGJEBSYVIw5+bqfoXsXmL4fR4Y5 I28uQuidTG+NBOIwD6vua8k2RXg9vVB0mvxH2XCaBcei/qHZnp+simHxoae3bS5LBiSy X7fzX1dDfVdd/0SHsG4ZLyWSV5shjazN+Eebz8oMT1mV+b1iKnorZ3M/ax6nPx4cOPPt Y0dA== MIME-Version: 1.0 X-Received: by 10.221.39.195 with SMTP id tn3mr1109858vcb.2.1384441015728; Thu, 14 Nov 2013 06:56:55 -0800 (PST) Received: by 10.52.171.78 with HTTP; Thu, 14 Nov 2013 06:56:55 -0800 (PST) In-Reply-To: References: Date: Thu, 14 Nov 2013 09:56:55 -0500 Message-ID: Subject: Re: field collapsing performance in sharded environment From: Erick Erickson To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001a113375c2f482b604eb244a8e X-Virus-Checked: Checked by ClamAV on apache.org --001a113375c2f482b604eb244a8e Content-Type: text/plain; charset=ISO-8859-1 bq: Of the 10k docs, most have a unique near duplicate hash value, so there are about 10k unique values for the field that I'm grouping on. I suspect (but don't know the grouping code well) that this is the issue. You're getting the top N groups, right? But in the general case, you can't insure that the topN from shard1 has any relation to the topN from shard2. So I _suspect_ that the code returns all of the groups. Say that shard1 for group 5 has 3 docs, but for shard2 has 3,000 docs. Do get the true top N, you need to collate all the values from all the groups; you can't just return the top 10 groups from each shard and get correct counts. Since your group cardinality is about 10K/shard, you're pushing 10 packets each containing 10K entries back to the originating shard, which has to combine/sort them all to get the true top N. At least that's my theory. Your situation is special in that you say that your groups don't appear on more than one shard, so you'd probably have to write something that aborted this behavior and returned only the top N, if I'm right. But that begs the question of why you're doing this. What purpose is served by grouping on documents that probably only have 1 member? Best, Erick On Wed, Nov 13, 2013 at 2:46 PM, David Anthony Troiano < dtroiano@basistech.com> wrote: > Hello, > > I'm hitting a performance issue when using field collapsing in a > distributed Solr setup and I'm wondering if others have seen it and if > anyone has an idea to work around. it. > > I'm using field collapsing to deduplicate documents that have the same near > duplicate hash value, and deduplicating at query time (as opposed to > filtering at index time) is a requirement. I have a sharded setup with 10 > cores (not SolrCloud), each having ~1000 documents each. Of the 10k docs, > most have a unique near duplicate hash value, so there are about 10k unique > values for the field that I'm grouping on. The grouping parameters that > I'm using are: > > group=true > group.field= > group.main=true > > I'm attempting distributed queries (&shards=s1,s2,...,s10) where the only > difference is the absence or presence of these three grouping parameters > and I'm consistently seeing a marked difference in performance (as a > representative data point, 200ms latency without grouping and 1600ms with > grouping). Interestingly, if I put all 10k docs on the same core and query > that core independently with and without grouping, I don't see much of a > latency difference, so the performance degradation seems to exist only in > the sharded setup. > > Is there a known performance issue when field collapsing in a sharded setup > (perhaps only manifests when the grouping field has many unique values), or > have other people observed this? Any ideas for a workaround? Note that > docs in my sharded setup can only have the same signature if they're in the > same shard, so perhaps that can be used to boost perf, though I don't see > an exposed way to do so. > > A follow-on question is whether we're likely to see the same issue if / > when we move to SolrCloud. > > Thanks, > Dave > --001a113375c2f482b604eb244a8e--