Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2B52917A8A for ; Wed, 29 Oct 2014 14:38:18 +0000 (UTC) Received: (qmail 50577 invoked by uid 500); 29 Oct 2014 14:38:16 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 50511 invoked by uid 500); 29 Oct 2014 14:38:16 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 50489 invoked by uid 99); 29 Oct 2014 14:38:15 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Oct 2014 14:38:15 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of msokolov@safaribooksonline.com designates 209.85.216.53 as permitted sender) Received: from [209.85.216.53] (HELO mail-qa0-f53.google.com) (209.85.216.53) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Oct 2014 14:37:48 +0000 Received: by mail-qa0-f53.google.com with SMTP id n8so2151092qaq.26 for ; Wed, 29 Oct 2014 07:35:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=safaribooksonline.com; s=google; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=OajR9oZnIz3Iw6kziSWYGquoQGi5McpDZryGKdvH8LA=; b=Cn/6+AoRH5tBCND1cqUzRJY4Dna6Cz5GVYhgUvc3aki++F7ImiLyEkU/20WQ39ATTN mG2Dgg72Sl1sNt9LO+IXxcaoB6EZFP6tyYSIVk0XWqa5e8B1tqERpywajdPPiaVCBGmQ w7mHSl73x1CBqut2g6hb/xljRlIIZ02gJDKww= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to :subject:references:in-reply-to:content-type :content-transfer-encoding; bh=OajR9oZnIz3Iw6kziSWYGquoQGi5McpDZryGKdvH8LA=; b=GD1shnDq+HPq7Ra6VFEHe51Wv5t1b64gTQij5fbdEWJOxGB3mSUqw1coEdxbgweTzD Npyqta7Moq3ETbxj6wAysCt0VkQr4bPkjotTL8qynEKc79Tgj31gAwHS/WCRwKHheZr5 aYo9Li7duQJ0f4FMJU8SRsC9KgBxg0wE0180yLr6nLTtvr/A/wiEHxkyVDcUaiM0xj61 8rbeTL/vyh5XEkSt0m8E1yIq0o+NFBD1dGiW826KGOxL9bsG2TVcaZY8s3L3iwXaNAPj x+30Vr1X9OEd6KtlEBm3Ppe8Ra7f8ZQMIl3e13E11ewu+RHql0Q7rvJe/WSoM6x8gFgw x5pA== X-Gm-Message-State: ALoCoQk622tU/6tqbDXRz6C2/I7od1CJ+49n3PKgLgmWW7yZv34TsNIliSESChTOeonwXIZhsShR X-Received: by 10.140.84.106 with SMTP id k97mr15501096qgd.76.1414593332278; Wed, 29 Oct 2014 07:35:32 -0700 (PDT) Received: from hull.bos.safaribooks.com ([216.236.250.2]) by mx.google.com with ESMTPSA id i1sm4266698qaz.28.2014.10.29.07.35.31 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 29 Oct 2014 07:35:31 -0700 (PDT) Message-ID: <5450FB32.8040800@safaribooksonline.com> Date: Wed, 29 Oct 2014 10:35:30 -0400 From: Michael Sokolov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.7.0 MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Query with many clauses References: <5450CFF3.7040801@safaribooksonline.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org I did some analysis with access-control lists and found that our customers have significant overlap in the documents they have access to, so we would be able to realize very nice compression in the number of terms in access control queries by indexing overlapping subsets. However this is a fair amount of effort since it requires analyzing all the access lists periodically and re-indexing some set of documents when that changes. We're able to achieve good-enough performance by simply caching a filter we generate when a session starts - even though the initial query may be kind of slow, we only run it once and the user is largely unaffected. Maybe you can play some trick like that? -Mike On 10/29/2014 08:20 AM, Pawel Rog wrote: > Hi, > I already tried to transform Queries to filter (TermQuery -> TermFilter) > but didn't see much speed up. I wrote that wrapped this filter into > ConstantScoreQuery and in other test I used FilteredQuery with > MatchAllDocsQuery and BooleanFilter. Both cases seems to work quite similar > in terms of performance to simple BooleanQuery. > But of course I'll also try to use TermsFilter. Maybe it will speedUp > filters. > > Michael Sokolov I haven't prepared any statistics about number of > BooleanClauses used and if there are some repeating sets of terms. I think > I have to collect some stats for better understanding what can be improved. > > -- > Paweł Róg > > > On Wed, Oct 29, 2014 at 12:30 PM, Michael Sokolov < > msokolov@safaribooksonline.com> wrote: > >> I'm curious to know more about your use case, because I have an idea for >> something that addresses this, but haven't found the opportunity to develop >> it yet - maybe somebody else wants to :). The basic idea is to reduce the >> number of terms needed to be looked up by collapsing commonly-occurring >> collections of terms into synthetic "tiles". If your queries have a lot of >> overlap, this could greatly reduce the number of terms in a query rewritten >> to use tiles. It's sort of complex, requires indexing support, or a filter >> cache, and there's no working implementation as yet, so this is probably >> not really going to be helpful for you in the short term, but if you can >> share some information I'd love to know: >> >> what kind of things are you searching? >> how many terms do your larger queries have? >> do the query terms overlap among your queries? >> >> -Mike Sokolov >> >> >> On 10/28/14 9:40 PM, Pawel Rog wrote: >> >>> Hi, >>> I have to run query with a lot of boolean should clauses. Queries like >>> these were of course slow so I decided to change query to filter wrapped >>> by >>> ConstantScoreQuery but it also didn't help. >>> >>> Profiler shows that most of the time is spent on seekExact in >>> BlockTreeTermsReader$FieldReader$SegmentTermsEnum >>> >>> When I go deeper in trace I see that inside seekExact most time is spent >>> on >>> loadBlock and even deeper ByteBufferIndexInput.clone. >>> >>> Do you have any ideas how I can make it work faster or it is not possible >>> and I have to live with it? >>> >>> -- >>> Paweł Róg >>> >>> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org