lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nestel, Frank" <frank.nes...@coi.de>
Subject AW: Token retrieval question
Date Thu, 11 Oct 2001 13:36:51 GMT

Hey, great, at least my ideas are not entirely wrong.

It seems like both solutions would suffice for me right
now. Solution 2 would be the more elegant route to go.
I cannot estimate how much work is involved. How much
time do you expect is needed. What can I contribute to that?

Fact is that the serious application I have is not at all
acute and the less serious application is only a private 
research project, but could use the feature as soon as it is 
there  (in case s.o. is interested visit 
	http://frank.spieleck.de/metasuch, 
but not all at once :-) ). On the other side I'll be away from 
computers next two weeks.

Guess, I'll just stick to this list an see what is happening.

And anyway it is great to have Lucene!

Thank you,

Frank

> -----Urspr├╝ngliche Nachricht-----
> Von: Doug Cutting [mailto:DCutting@grandcentral.com]
> Gesendet am: Mittwoch, 10. Oktober 2001 19:19
> An: 'lucene-dev@jakarta.apache.org'
> Betreff: RE: Token retrieval question
> 
> Right now, Lucene does not have good support for what you're 
> doing.  Lucene
> as it stands is designed to support basic search, not other 
> statistical text
> processing.  However there are two features that I would like 
> to add to
> Lucene that would help you.
> 
> 1. Seekable TermDocs.
> 
> This would let you efficiently skip forward in a TermDocs to 
> a particular
> document number.  This would enable some search optimizations.  This
> requires no API changes, as the TermDocs.skipTo() method 
> already exists.
> 
> 2. Stored Document Vectors
> 
> These would enable one to determine the set of terms in a 
> document.  This
> would be useful for, e.g. document clustering.
> 
> This would add an IndexReader two methods:
>   public TermFreqVector getTermFreqVector(int docNumber);
>   public Term getTerm(int termNumber);
> The TermFreqVector class would be defined something like:
>   public class TermFreqVector {
>     public int[] getTermNumbers();
>     public int[] getTermFrequencies();
>   }
> The term number array would be sorted.  The frequency of the 
> term numbered
> getTermNumbers()[i] is getTermFrequencies()[i].
> 
> Another class that would be useful is something like:
>   public class TermWeightVector {
>     public int[] getTermNumbers();
>     public float[] getTermWeights();
> 
>     public void add(TermWeightVector other);
>     public float distance(TermWeightVector other);
>   }
> 
> Both of these are long-term changes, so it may be a while 
> before they are
> completed.  That said, I would like to implement them, when I 
> have time!
> 
> Doug
> 
> > -----Original Message-----
> > From: Nestel, Frank [mailto:frank.nestel@coi.de]
> > Sent: Wednesday, October 10, 2001 12:23 AM
> > To: 'lucene-dev@jakarta.apache.org'
> > Subject: Token retrieval question
> > 
> > 
> > 
> > Hi,
> > 
> > I've been reading the API and I couldn't figure out a
> > nice and fast way to solve the following problem:
> > 
> > I'd like to enumerate the tokens of a document (or 
> > document field). Do the internal datastructures
> > of lucene allow such kind of traversal which is (as
> > I understand) of course orthogonal to the access lucene 
> > is optimized for? 
> > 
> > More concrete I have s.th. like 20-50 tokens/words and one
> > document and I'd like to ask the document if (and how often)
> > it contains those particular tokens. The idea was to augment
> > search results with (kind of I know) automatic query
> > dependand keywords.
> > 
> > The only way I see right now is to create 20-50 TermEnums
> > and walk through them until I end up in my document or
> > nowhere? Which is probably not feasible for a search result
> > page with (say) 20 hits in a larger index.
> > 
> > Any (more elegant) chance, I missed?
> > 
> > Thank you,
> > Frank
> > 
> > --
> > Dr. Frank Sven Nestel
> > Principal Software Engineer
> > 
> > COI GmbH    Erlanger Stra├če 62, D-91074 Herzogenaurach
> > Phone +49 (0) 9132 82 4611 
> > http://www.coi.de, mailto:Frank.Nestel@coi.de
> >           COI - Solutions for Documents
> > 
> 

Mime
View raw message