lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <DCutt...@grandcentral.com>
Subject RE: context and hit positions with Lucene
Date Fri, 05 Oct 2001 16:15:14 GMT
Please see also Maik Schreiber's message on this topic:

  http://www.geocrawler.org/archives/3/2624/2001/9/50/6553088/

The approach is to re-tokenize hit documents, scanning for query terms.  The
index does not store the byte-position of words in the original document.
Only the tokenizer has that information.  The index only stores the ordinal
position, e.g., that a term was the twelfth term in a document, while the
tokenizer can tell you, e.g., that a term occurs between bytes 291 and 301
in the text, which is what you need for highlighting.

Perhaps we should add a utility method such as:

  public static Set getHitTokens(Set queryTerms, Reader text, Analyzer a)
    throws IOException {
    TokenStream ts = a.tokenStream(text);
    Set hitTokens = new HashSet();
    for (Token token = ts.next(); token != null; token = ts.next()) {
      if (queryTerms.contains(token.termText())) {
        hitTokens.add(token);
      }
    }
    return hitTokens;
  }

(I have not tested this code.)

What class would we add this to?  If we add it to Query then it could take a
Query instead of a Set.  As Maik points out, there is currently no public
method that returns the set of terms in a query.  That should probably be
added in any case.

Doug

> -----Original Message-----
> From: Lee Mallabone [mailto:lee@grantadesign.com]
> Sent: Thursday, October 04, 2001 9:00 AM
> To: lucene-dev@jakarta.apache.org
> Subject: context and hit positions with Lucene
> 
> 
> Hi,
> 
> I've been lurking around the Lucene source code for about a 
> week now...
> There are a couple of things I can't work out how to do 
> properly I'd be
> grateful for any help with.
> 
> I'm having a bit of trouble using hit positions in a test 
> application, the
> results of which look like I may need to contribute some code 
> to Lucene for
> things to work as I'd like.
> 
> At the moment, I'm doing something along the lines of the 
> following, to
> retrieve hit positions:
> 
> // Open an index and retrieve the hit positions object
> IndexReader reader = IndexReader.open("index_file");
> TermPositions hitPoints = reader.termPositions(new Term("contents",
> "metal"));
> TermDocs docs = (TermDocs) hitPoints;
> 
> // While a document remains, loop
> while ( docs.next())
> {
>   out.print("Finding hit values for document <b>"+ docs.doc()+"</b>");
>   for (int j=0; j<docs.freq(); j++)
>   {
>     // Output the hit position
>     out.print(", "+hitPoints.nextPosition());
>   }
>   out.println("<br>");
> }
> reader.close();
> 
> I'm not able to do a great deal with that information at the 
> moment. What
> I'd really like to be able to do is get the relevant info in my actual
> search results loop. So I'd call something like this:
> 
> while (search_results_remain) {
>   Document doc = hits.doc(i);
>   int[] documentHitPositions = doc.getHitPositions();
>   // display fragments with 3 hits in the context text
>   String someContextInfo = hits.getContextInfo(i, 3);
> }
> 
> My main difficulties with the existing way of doing things is:
> 1) The call to termPositions() doesn't integrate with 
> QueryParser.parse()
> and that appears to be the only correct way to use complex 
> queries such as
> wildcards, booleans, etc.
> Is there any way, given a query, to get the list of 'Term' 
> objects that were
> created for the query? This would help me to an extent as I'd 
> be able to
> generate complete hit positions, rather than just for an 
> arbitrary term.
> 2) Retrieving the hit positions doesn't integrate with the 'Hits' or
> Document objects, where it would be the most convenient, 
> imho, (as in my
> example, above). Is it feasible to integrate such functionality?
> 
> Showing some amount of context for each search result is 
> something that my
> company considers to be really important for adopting any 
> search engine.
> Could anyone point me in the right direction for what 
> changes, if any need
> to be made to facilitate such a thing? If so, I may well be allowed to
> contribute to Lucene on company time. From browsing the source and the
> documentation, it appears that various things are in place to 
> facilitate
> implementing context information, I'm just not sure where exactly to
> start...
> 
> Regards,
> 
> Lee Mallabone
> Granta Design Ltd.
> 
> 
> 

Mime
View raw message