lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: Counting hits in a document
Date Thu, 18 Jan 2007 23:29:01 GMT

It was late this afternooon and I was square-eyed, so I didn't add the
detail. The app we're working on first returns a summary list of all the
books that match a query, no hit information. Next, the user clicks on a
returned title and we show the hits by chapter. That is, a list of chapters
and the count of the hits for each. The index is nearing 15G at present, so
I *assumed* that I really didn't want to re-query the entire index when I
know the particular document I care about already. But what do I know?


Very most excellent. I'll give it a look in the morning. I hope that the
class doesn't need the raw text since I don't have it any more, but your
comment "Give it a query it will give you the spans" makes me hopeful.

The real issue is that it looks like I'm reverting to my old "C" days. The
code I was writing the last couple of days started to look like a program
from...well...a long time ago. So I *know* it must be wrong <G>...... It's a
real pain in the neck to *think* in Java terms when much of my training was
before this new-fangled way of looking at programming problems happened. I
suppose I could go into management, but that would be giving in to the dark

Thanks all

On 1/18/07, Mark Miller <> wrote:
> Just threw together a highlighter that can handle spans (combining a
> rewrite with dumspans from LIA) and used this:
> Nice spans extractor from Mark (not me <G>). Give it a query it will
> give you the spans.
> - Mark
> Erick Erickson wrote:
> > Hi again.
> >
> > I've been struggling for the last couple of days and getting nowhere, so
> > it's time to swallow my pride and say "Help"....
> >
> > OK, let's say I have a document indexed and I do NOT have access to
> > the raw
> > text. I need to find the offset of all the hits for a query on a single
> > document. Advice was offered a while ago to use getSpans from a
> > spanquery,
> > but for the life of me I don't see how to make this work. As I remember,
> > Erik was talking about rewriting the original query as a set of spans.
> >
> > The trouble I'm having is that I sure don't see how to rewrite the
> > standard
> > query as a span query, then feed that back into my index for a
> particular
> > document (that I have a unique ID for). It seems that the getSpans looks
> > through my entire index, which is *probably* prohibitive.
> >
> > I can make each part of the query into a SpanTermQuery. I can assemble
> > these
> > together into a bunch of nested span queries. At the end of this, I
> > have a
> > single Span query that I can call getSpans on. But what now? I don't
> > see how
> > the spans relate to the document I need to focus on. From what I see
> > of the
> > Spans interface, it's intended to look at the entire index rather than
> be
> > confined to a subset of the documents (in this case, exactly one.
> > Guaranteed).
> >
> > I've thought about putting the documentID in a MUST clause of a
> > BooleanQuery, and adding my span query to that, but it doesn't look like
> > getSpans does me any good there.
> >
> > I looked at the SrndQuery family and don't see anything there that
> > lets me
> > get the offsets of my matches.
> >
> > I don't have the text, so I can't highlight all the hits and count.
> >
> > The code I've been writing feels like the wrong solution to the wrong
> > problem at the wrong time. Given that I know the document ID on the
> > way in,
> > is my best bet to roll my own? That is, enumerate the relevant terms
> > in my
> > document and measure the distance between the terms and aggregate the
> > results myself? I'd rather not do that, of course, but can if necessary.
> >
> > I *want* someone to say "just call <fill in magic method here>"....
> >
> > Any help greatly appreciated...
> >
> > Thanks
> > Erick
> >
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message