Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CF7C6100A2 for ; Wed, 23 Oct 2013 15:06:20 +0000 (UTC) Received: (qmail 27388 invoked by uid 500); 23 Oct 2013 15:06:17 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 27300 invoked by uid 500); 23 Oct 2013 15:06:17 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 27288 invoked by uid 99); 23 Oct 2013 15:06:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Oct 2013 15:06:17 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.128.179] (HELO mail-ve0-f179.google.com) (209.85.128.179) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Oct 2013 15:06:13 +0000 Received: by mail-ve0-f179.google.com with SMTP id cz12so524646veb.24 for ; Wed, 23 Oct 2013 08:05:52 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type:content-transfer-encoding; bh=JXchgg1EQOLGcnx1t2QocZnEIcRWVCUW+8l22V/9GoE=; b=hmraEis/QKoTiHCQEeNIU+PWkcSOJ10f6CMSq8l1x6E4+Mb8dgKRNyKGnXSr0JNreV KADiHSo4vzZHE66KUJC5y1fLZk3DUUXcrcM+Vgfr/S5QZ4g9NMXMd021QBbMeyV+Ae+Q HpBKqxokkGICHTfYgUwg+C7IyvzxR78DzH1J8YnW+Ko4c6cZK/MTY7CtP90d36n4DQfl cnZAArVY9eQ2Nf6cse87wo+kUdQ8ePJiOPoo6sz3AYV/pu4GBj+qBDM668vqJ1amIC7O C5XZGG1r2rKReaR2/ahbWneFWquW9AykdjlNHzDnqqwDaQJVyuMAeI30yVoVlONXpMor GXMA== X-Gm-Message-State: ALoCoQki1ruEfa/rE3EszI0ZMQ7O+A9Y/2MoTA8d0ZYk84+7r3fraURMxszZ0GZcNYpA7LFBWXub X-Received: by 10.58.180.227 with SMTP id dr3mr2876vec.36.1382540752133; Wed, 23 Oct 2013 08:05:52 -0700 (PDT) MIME-Version: 1.0 Received: by 10.220.72.131 with HTTP; Wed, 23 Oct 2013 08:05:32 -0700 (PDT) In-Reply-To: <37531382449432@webcorp1h.yandex-team.ru> References: <13891381357052@webcorp1g.yandex-team.ru> <40781382021812@webcorp1h.yandex-team.ru> <34621382047527@webcorp1g.yandex-team.ru> <107051382133024@webcorp1g.yandex-team.ru> <37531382449432@webcorp1h.yandex-team.ru> From: Michael McCandless Date: Wed, 23 Oct 2013 11:05:32 -0400 Message-ID: Subject: Re: Lucene in-memory index To: Lucene Users Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Tue, Oct 22, 2013 at 9:43 AM, Igor Shalyminov wrote: > Thanks for the link, I'll definitely dig into SpanQuery internals very so= on. You could also just make a custom query. If you start from the ProxBooleanTermQuery on that issue, but change it so that it rejects hits that didn't have terms in the right positions, then you'll likely have a much faster way to do your query. >>> For "A,sg" and "A,pl" I use unordered SpanNearQueries with the slop=3D= -1. >> >> I didn't even realize you could pass negative slop to span queries. >> What does that do? Or did you mean slop=3D1? > > I indeed use an unordered SpanNearQuery with the slop =3D --1 (I saw it o= n some forum, maybe here: http://www.gossamer-threads.com/lists/lucene/java= -user/89377?do=3Dpost_view_flat#89377) Wow, OK. I have no idea what slop=3D-1 does... > So far it works for me:) > >> >>> I wrap them into an ordered SpanNearQuery with the slop=3D0. >>> >>> I see getPayload() in the profiler top. I think I can emulate payload = checking with cleverly assigned position increments (and then maximum posit= ion in a document might jump up to ~10^9 - I hope it won't blow the whole i= ndex up). >>> >>> If I remove payload matching and keep only position checking, will it = speed up everything, or the positions and payloads are the same? >> >> I think it would help to avoid payloads, but I'm not sure by how much. >> E.g., I see that NearSpansOrdered creates a new Set for every hit >> just to hold payloads, even if payloads are not going to be used. >> Really the span scorers should check Terms.hasPayloads up front ... >> >>> My main goal is getting the precise results for a query, so proximity = boosting won't help, unfortunately. >> >> OK. >> >> I wonder if you can somehow identify the spans you care about at >> indexing time, e.g. A,sg followed by N,sg and e.g. add a span into the >> index at that point; this would make searching much faster (it becomes >> a TermQuery). For exact matching (slop=3D0) you can also index >> shingles. > > Thanks for the clue, I think it can be a good optimization heuristic. > I actually tried a similar approach to optimize search of attributes at t= he same position. > Here's how it was supposed to work for a feature set "S,sg,nom,fem": > > * the regular approach: split it into grammar atomics: "S", "sg", "nom", = "fem". With payloads and positions assigned the right way, this would allow= us to search for an arbitrary combination of these attributes _but_ with m= ultiple postings merging. > * the experimental approach: sort the atomics lexicographically and index= all the subsets: "S", "fem", "nom", "sg", "S,fem", "S,nom", ..., "S,fem,no= m,sg". With the preprocessing of the user query the same way (split - sort = - join) it would allow us to process the same queries exactly within one po= sting. > > This technique is actually used in our current production index based on = Yandex.Server engine. > But Yandex.Server somehow makes the index size reasonable (within the ord= er of magnitude of original text size), while Lucene index blows up totally= ( >10 times original text size) and no search performance improvements app= ear. That's really odd. I would expect index to become much larger, but search performance ought to be much faster since you run simple TermQuery. Mike McCandless http://blog.mikemccandless.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org