Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 42950 invoked from network); 18 Jan 2007 21:07:33 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 18 Jan 2007 21:07:33 -0000 Received: (qmail 78465 invoked by uid 500); 18 Jan 2007 21:07:33 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 78439 invoked by uid 500); 18 Jan 2007 21:07:32 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 78428 invoked by uid 99); 18 Jan 2007 21:07:32 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Jan 2007 13:07:32 -0800 X-ASF-Spam-Status: No, hits=2.6 required=10.0 tests=HTML_00_10,HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of erickerickson@gmail.com designates 66.249.92.169 as permitted sender) Received: from [66.249.92.169] (HELO ug-out-1314.google.com) (66.249.92.169) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Jan 2007 13:07:23 -0800 Received: by ug-out-1314.google.com with SMTP id k40so286571ugc for ; Thu, 18 Jan 2007 13:07:02 -0800 (PST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:mime-version:content-type; b=F3IWd5GwK0V/2xE+fyqzgaHoCQzoLU4AvtShveSQw1XXGpRS/s0BmsHdn1zgynrUcpwcsvDV2IMZFOvwV49WcCFBffi6+T1V/7TYIirAnA25KrgXGQUWTCHPXpDgbS/soOoELL9XIr+PTKLiFVRp9vXgezc7JDCEOnilxs7o0QA= Received: by 10.82.172.15 with SMTP id u15mr357683bue.1169154421928; Thu, 18 Jan 2007 13:07:01 -0800 (PST) Received: by 10.82.162.9 with HTTP; Thu, 18 Jan 2007 13:06:57 -0800 (PST) Message-ID: <359a92830701181306k2d97c9d2we5e85ce08f008c06@mail.gmail.com> Date: Thu, 18 Jan 2007 16:06:57 -0500 From: "Erick Erickson" To: java-user@lucene.apache.org Subject: Counting hits in a document MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_156286_30634032.1169154417783" X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_156286_30634032.1169154417783 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline Hi again. I've been struggling for the last couple of days and getting nowhere, so it's time to swallow my pride and say "Help".... OK, let's say I have a document indexed and I do NOT have access to the raw text. I need to find the offset of all the hits for a query on a single document. Advice was offered a while ago to use getSpans from a spanquery, but for the life of me I don't see how to make this work. As I remember, Erik was talking about rewriting the original query as a set of spans. The trouble I'm having is that I sure don't see how to rewrite the standard query as a span query, then feed that back into my index for a particular document (that I have a unique ID for). It seems that the getSpans looks through my entire index, which is *probably* prohibitive. I can make each part of the query into a SpanTermQuery. I can assemble these together into a bunch of nested span queries. At the end of this, I have a single Span query that I can call getSpans on. But what now? I don't see how the spans relate to the document I need to focus on. From what I see of the Spans interface, it's intended to look at the entire index rather than be confined to a subset of the documents (in this case, exactly one. Guaranteed). I've thought about putting the documentID in a MUST clause of a BooleanQuery, and adding my span query to that, but it doesn't look like getSpans does me any good there. I looked at the SrndQuery family and don't see anything there that lets me get the offsets of my matches. I don't have the text, so I can't highlight all the hits and count. The code I've been writing feels like the wrong solution to the wrong problem at the wrong time. Given that I know the document ID on the way in, is my best bet to roll my own? That is, enumerate the relevant terms in my document and measure the distance between the terms and aggregate the results myself? I'd rather not do that, of course, but can if necessary. I *want* someone to say "just call ".... Any help greatly appreciated... Thanks Erick ------=_Part_156286_30634032.1169154417783--