lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Kor <dave...@yahoo.com>
Subject RE: Token retrieval question
Date Thu, 11 Oct 2001 01:09:14 GMT
You can count me in on this :) 

--- Doug Cutting <DCutting@grandcentral.com> wrote:
> Right now, Lucene does not have good support for
> what you're doing.  Lucene
> as it stands is designed to support basic search,
> not other statistical text
> processing.  However there are two features that I
> would like to add to
> Lucene that would help you.
> 
> 1. Seekable TermDocs.
> 
> This would let you efficiently skip forward in a
> TermDocs to a particular
> document number.  This would enable some search
> optimizations.  This
> requires no API changes, as the TermDocs.skipTo()
> method already exists.
> 
> 2. Stored Document Vectors
> 
> These would enable one to determine the set of terms
> in a document.  This
> would be useful for, e.g. document clustering.
> 
> This would add an IndexReader two methods:
>   public TermFreqVector getTermFreqVector(int
> docNumber);
>   public Term getTerm(int termNumber);
> The TermFreqVector class would be defined something
> like:
>   public class TermFreqVector {
>     public int[] getTermNumbers();
>     public int[] getTermFrequencies();
>   }
> The term number array would be sorted.  The
> frequency of the term numbered
> getTermNumbers()[i] is getTermFrequencies()[i].
> 
> Another class that would be useful is something
> like:
>   public class TermWeightVector {
>     public int[] getTermNumbers();
>     public float[] getTermWeights();
> 
>     public void add(TermWeightVector other);
>     public float distance(TermWeightVector other);
>   }
> 
> Both of these are long-term changes, so it may be a
> while before they are
> completed.  That said, I would like to implement
> them, when I have time!
> 
> Doug
> 
> > -----Original Message-----
> > From: Nestel, Frank [mailto:frank.nestel@coi.de]
> > Sent: Wednesday, October 10, 2001 12:23 AM
> > To: 'lucene-dev@jakarta.apache.org'
> > Subject: Token retrieval question
> > 
> > 
> > 
> > Hi,
> > 
> > I've been reading the API and I couldn't figure
> out a
> > nice and fast way to solve the following problem:
> > 
> > I'd like to enumerate the tokens of a document (or
> 
> > document field). Do the internal datastructures
> > of lucene allow such kind of traversal which is
> (as
> > I understand) of course orthogonal to the access
> lucene 
> > is optimized for? 
> > 
> > More concrete I have s.th. like 20-50 tokens/words
> and one
> > document and I'd like to ask the document if (and
> how often)
> > it contains those particular tokens. The idea was
> to augment
> > search results with (kind of I know) automatic
> query
> > dependand keywords.
> > 
> > The only way I see right now is to create 20-50
> TermEnums
> > and walk through them until I end up in my
> document or
> > nowhere? Which is probably not feasible for a
> search result
> > page with (say) 20 hits in a larger index.
> > 
> > Any (more elegant) chance, I missed?
> > 
> > Thank you,
> > Frank
> > 
> > --
> > Dr. Frank Sven Nestel
> > Principal Software Engineer
> > 
> > COI GmbH    Erlanger Straße 62, D-91074
> Herzogenaurach
> > Phone +49 (0) 9132 82 4611 
> > http://www.coi.de, mailto:Frank.Nestel@coi.de
> >           COI - Solutions for Documents
> > 


__________________________________________________
Do You Yahoo!?
Make a great connection at Yahoo! Personals.
http://personals.yahoo.com

Mime
View raw message