From java-user-return-41953-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Fri Aug 21 04:30:23 2009 Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 30339 invoked from network); 21 Aug 2009 04:30:23 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 21 Aug 2009 04:30:23 -0000 Received: (qmail 49993 invoked by uid 500); 21 Aug 2009 04:30:39 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 49804 invoked by uid 500); 21 Aug 2009 04:30:39 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 49794 invoked by uid 99); 21 Aug 2009 04:30:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Aug 2009 04:30:38 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [68.230.240.48] (HELO eastrmmtao106.cox.net) (68.230.240.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Aug 2009 04:30:30 +0000 Received: from eastrmimpo01.cox.net ([68.1.16.119]) by eastrmmtao106.cox.net (InterMail vM.8.00.01.00 201-2244-105-20090324) with ESMTP id <20090821043009.RMIP19495.eastrmmtao106.cox.net@eastrmimpo01.cox.net>; Fri, 21 Aug 2009 00:30:09 -0400 Received: from eastrmwml36 ([172.18.18.217]) by eastrmimpo01.cox.net with bizsmtp id WsW81c0094h0NJL02sW8YH; Fri, 21 Aug 2009 00:30:08 -0400 X-VR-Score: -200.00 X-Authority-Analysis: v=1.0 c=1 a=g_Tv2rkpugkA:10 a=FbwGDW88AAAA:8 a=kviXuzpPAAAA:8 a=PX6GZhbMICEtm8vSDpkA:9 a=9OzDIp0q-d-jwhaXOz4A:7 a=5VMmiVM5VclWsUIEH8gy-TRoBBUA:4 a=pVPEDnPezCQA:10 a=4vB-4DCPJfMA:10 a=iyAS19UhfmkuS5y2:21 a=LUXhJWAU-sylPDOH:21 X-CM-Score: 0.00 Received: from 72.196.195.196 by webmail.east.cox.net; Fri, 21 Aug 2009 0:30:08 -0400 Message-ID: <20090821003009.C6LIW.31472.imail@eastrmwml36> Date: Fri, 21 Aug 2009 0:30:09 -0400 From: To: java-user@lucene.apache.org Subject: Re: Possible to invoke same Lucene query on a String? Cc: Paul Cowan In-Reply-To: <4A8E2024.1030103@aconex.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) Sensitivity: Normal X-Virus-Checked: Checked by ClamAV on apache.org ---- Paul Cowan wrote: > ohaya@cox.net wrote: > > - I'd have to create a (very small) index, for each sub-document, where I do the Document.add() with just the (for example) two terms, then > > - Run a query against the 1-entry index, which > > - Would either give me a "yes" or "no" (for that sub-document) > > > > As I said, I'm concerned about overhead. Some of the documents are quite large, containing >20K sub-documents. That means that, for such a document, I'd have to create >20K indexes. > > No, I'm talking about a separate document in the same index. > > There are a few approaches here: > > 1) Index each sub-document separately. So if you have fields 'doc#', > 'docname', 'subdoc#', and 'subdocterms', you might do: > > for (Doc parent : Docs) { > for (SubDoc child : parent.subDocs()) { > Document luceneDoc = new Document(); > doc.add(new Field("doc#", parent.number())); > doc.add(new Field("docname", parent.name())); > doc.add(new Field("subdoc#", child.number())); > doc.add(new Field("subdocterms", child.data())); > } > } > > This means that in your index after indexing 2 docs with 2 subdocs each, > you'll have > (Lucene #) doc# docname subdoc# subdocterms > ---------------------------------------------------- > 0 100 Foo 101 subdoc1 terms here > 1 100 Foo 102 subdoc2 terms > 2 200 Bar 201 subdoc1 terms from doc2 > 3 200 Bar 202 some more subdoc text > > So the search you're doing is actually on the subdoc level. This can get > complicated, especially as subdocs from the same parent doc may come > back out of order, etc, depending on scoring/sorting. > > Also, if there is a lot of data at the parent level, you're obviously > duplicating it. This can get nasty. > > 2) Maintain a (logically) separate subdoc index. You could have > something like: > doc# docname bigblobofdocdata > --------------------------------- > 100 Foo lots of data here... > 200 Bar and lots more here.. > in one index, and > doc# subdoc# subdocterms > --------------------------------- > 100 101 subdoc1 terms here > 100 102 subdoc2 terms > 200 201 subdoc1 terms from doc2 > 200 202 some more subdoc text > > Then you can FIRST search on the doc index to do any matches on > 'docname' etc, then use the IDs you find to filter the subdoc index -- > so if the user searches for 'docname=foo' and 'subdocterms=text', you > first do the docname search to get the docname-matching doc (100), then > do a search on the second index for 'subdocterms', but also filter where > doc#=100. > > Note they don't HAVE to be separate indexes -- you can actually keep > these in the same physical index, with some sort of discriminator (all > docs in an index don't have to have the same fields). > > 3) Do some really hardcore tricks with spanqueries. This is what I'm > working on at the moment, so it's near and dear to my heart. It's not > for the faint-hearted, though, and if you're new to Lucene may scare you > off, sorry! Basically Lucene has the concept of 'positions' for terms -- > metadata about where in the document the term can be found. This lets > you do 'near' queries, etc. > > We're taking advantage of that to do some many-to-one stuff like your > problem. Using the first example, with term positions indicated in [], > we position terms from different subdocs with a large gap between them, > like so: > > (Lucene #) doc# docname subdoc# subdocterms > ---------------------------------------------------- > 0 100 Foo 101[0] subdoc1[0] terms[1] here[2] > 102[100] subdoc2[100] terms[101] > > 1 200 Bar 201[0] subdoc1[0] terms[1] from[2] > 202[100] doc2[3] some[100] more[101] > subdoc[102] text[103] > > So in each doc, subdoc #1's terms start at 0, #2's at 100, #3s at 200, > etc. Then when we search we can say 'the terms you're looking for must > be in the same 100-position block' to find only subdocs that match all > subdoc-related subqueries. This is pretty hairy but is working well for > us -- massively reduces our indexing and search times compared to the > duplicated document way I mentioned above. > > Cheers, > > Paul Paul, Oh boy, you've given me a LOT to chew on :)!! At first read, I like your #1 approach, maybe because it's easiest for me to understand. I have to think about it, but my first thought is that we might not need/want the sub-doc index to persist after they're used (maybe!), so create the sub-doc index "on-the-fly" for each Document, maybe using that example I linked as the template, do the query, then move on to the next Document... I'll have to think about it. Like I said, lots of ideas in your message :)... Having said that, I keep thinking wouldn't it be much easier if, as I originally posted, there was a way to invoke a "Lucene query" on just a String object :(?? Of course, if, after some more thought, it makes more sense to persist the sub-doc index(es), then I guess not... Again, thanks. Now, I'll have to re-read what you wrote, a couple of times. Jim --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org