lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Wang <john.w...@gmail.com>
Subject Re: indexing help
Date Wed, 07 Jul 2004 23:37:09 GMT
Hi Doug:
     Thanks for the response!

     The solution you proposed is still a derivative of creating a
dummy document stream. Taking the same example, java (5), lucene (6),
VectorTokenStream would create a total of 11 Tokens whereas only 2 is
neccessary.

    Given many documents with many terms and frequencies, it would
create many extra Token instances.

   The reason I was looking to derving the Field class is because I
can directly manipulate the FieldInfo by setting the frequency. But
the class is final...

   Any other suggestions?

Thanks

-John

On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting <cutting@apache.org> wrote:
> John Wang wrote:
> >      While lucene tokenizes the words in the document, it counts the
> > frequency and figures out the position, we are trying to bypass this
> > stage: For each document, I have a set of words with a know frequency,
> > e.g. java (5), lucene (6) etc. (I don't care about the position, so it
> > can always be 0.)
> >
> >      What I can do now is to create a dummy document, e.g. "java java
> > java java java lucene lucene lucene lucene lucene" and pass it to
> > lucene.
> >
> >      This seems hacky and cumbersome. Is there a better alternative? I
> > browsed around in the source code, but couldn't find anything.
> 
> Write an analyzer that returns terms with the appropriate distribution.
> 
> For example:
> 
> public class VectorTokenStream extends TokenStream {
>   private int term;
>   private int freq;
>   public VectorTokenStream(String[] terms, int[] freqs) {
>     this.terms = terms;
>     this.freqs = freqs;
>   }
>   public Token next() {
>     if (freq == 0) {
>       term++;
>       if (term >= terms.length)
>         return null;
>       freq = freqs[term];
>     }
>     freq--;
>     return new Token(terms[term], 0, 0);
>   }
> }
> 
> Document doc = new Document();
> doc.add(Field.Text("content", ""));
> indexWriter.addDocument(doc, new Analyzer() {
>   public TokenStream tokenStream(String field, Reader reader) {
>     return new VectorTokenStream(new String[] {"java","lucene"},
>                                  new int[] {5,6});
>   }
> });
> 
> >       Too bad the Field class is final, otherwise I can derive from it
> > and do something on that line...
> 
> Extending Field would not help.  That's why it's final.
> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message