lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject RE: Token retrieval question
Date Wed, 10 Oct 2001 17:19:02 GMT
Right now, Lucene does not have good support for what you're doing.  Lucene
as it stands is designed to support basic search, not other statistical text
processing.  However there are two features that I would like to add to
Lucene that would help you.

1. Seekable TermDocs.

This would let you efficiently skip forward in a TermDocs to a particular
document number.  This would enable some search optimizations.  This
requires no API changes, as the TermDocs.skipTo() method already exists.

2. Stored Document Vectors

These would enable one to determine the set of terms in a document.  This
would be useful for, e.g. document clustering.

This would add an IndexReader two methods:
  public TermFreqVector getTermFreqVector(int docNumber);
  public Term getTerm(int termNumber);
The TermFreqVector class would be defined something like:
  public class TermFreqVector {
    public int[] getTermNumbers();
    public int[] getTermFrequencies();
The term number array would be sorted.  The frequency of the term numbered
getTermNumbers()[i] is getTermFrequencies()[i].

Another class that would be useful is something like:
  public class TermWeightVector {
    public int[] getTermNumbers();
    public float[] getTermWeights();

    public void add(TermWeightVector other);
    public float distance(TermWeightVector other);

Both of these are long-term changes, so it may be a while before they are
completed.  That said, I would like to implement them, when I have time!


> -----Original Message-----
> From: Nestel, Frank []
> Sent: Wednesday, October 10, 2001 12:23 AM
> To: ''
> Subject: Token retrieval question
> Hi,
> I've been reading the API and I couldn't figure out a
> nice and fast way to solve the following problem:
> I'd like to enumerate the tokens of a document (or 
> document field). Do the internal datastructures
> of lucene allow such kind of traversal which is (as
> I understand) of course orthogonal to the access lucene 
> is optimized for? 
> More concrete I have like 20-50 tokens/words and one
> document and I'd like to ask the document if (and how often)
> it contains those particular tokens. The idea was to augment
> search results with (kind of I know) automatic query
> dependand keywords.
> The only way I see right now is to create 20-50 TermEnums
> and walk through them until I end up in my document or
> nowhere? Which is probably not feasible for a search result
> page with (say) 20 hits in a larger index.
> Any (more elegant) chance, I missed?
> Thank you,
> Frank
> --
> Dr. Frank Sven Nestel
> Principal Software Engineer
> COI GmbH    Erlanger Stra├če 62, D-91074 Herzogenaurach
> Phone +49 (0) 9132 82 4611 
>           COI - Solutions for Documents

View raw message