lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll" <gsing...@syr.edu>
Subject Re: indexing help
Date Thu, 08 Jul 2004 15:10:07 GMT
Hey John,

Those are just options, didn't say they were good ones!  :-)

I guess the real question is, what is the background of what you are trying to do?  Presumably
you have some other program that is generating frequencies for you, do you really need that
in the current form?  Can't the Lucene indexing engine act as a stand-in for this process
since your end result _should_ be the same?  The Lucene Analyzer process is quite flexible,
I bet you could even find a way to hook in your existing tools into the Analyzer process.

-Grant

>>> john.wang@gmail.com 07/08/04 10:42AM >>>
Hi Grant:
     Thanks for the options. How likely will the lucene file formats change?

     Are there really no more optiosn? :(...

Thanks

-John

On Thu, 08 Jul 2004 08:50:44 -0400, Grant Ingersoll <gsingers@syr.edu> wrote:
> Hi John,
> 
> The source code is available from CVS, make it non-final and do what you need to do.
 Of course, you may have a hard time finding help later if you aren't using something everyone
else is and your solution doesn't work...  :-)
> 
> If I understand correctly what you are trying to do, you already know all of the answers
for indexing, you just want Lucene to do the retrieval side of the coin, correct?  I suppose
a crazy idea might be to write a program that took your info and output it in the Lucene file
format, but that seems a bit like overkill.
> 
> -Grant
> 
> >>> john.wang@gmail.com 07/07/04 07:37PM >>>
> 
> 
> Hi Doug:
>     Thanks for the response!
> 
>     The solution you proposed is still a derivative of creating a
> dummy document stream. Taking the same example, java (5), lucene (6),
> VectorTokenStream would create a total of 11 Tokens whereas only 2 is
> neccessary.
> 
>    Given many documents with many terms and frequencies, it would
> create many extra Token instances.
> 
>   The reason I was looking to derving the Field class is because I
> can directly manipulate the FieldInfo by setting the frequency. But
> the class is final...
> 
>   Any other suggestions?
> 
> Thanks
> 
> -John
> 
> On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting <cutting@apache.org> wrote:
> > John Wang wrote:
> > >      While lucene tokenizes the words in the document, it counts the
> > > frequency and figures out the position, we are trying to bypass this
> > > stage: For each document, I have a set of words with a know frequency,
> > > e.g. java (5), lucene (6) etc. (I don't care about the position, so it
> > > can always be 0.)
> > >
> > >      What I can do now is to create a dummy document, e.g. "java java
> > > java java java lucene lucene lucene lucene lucene" and pass it to
> > > lucene.
> > >
> > >      This seems hacky and cumbersome. Is there a better alternative? I
> > > browsed around in the source code, but couldn't find anything.
> >
> > Write an analyzer that returns terms with the appropriate distribution.
> >
> > For example:
> >
> > public class VectorTokenStream extends TokenStream {
> >   private int term;
> >   private int freq;
> >   public VectorTokenStream(String[] terms, int[] freqs) {
> >     this.terms = terms;
> >     this.freqs = freqs;
> >   }
> >   public Token next() {
> >     if (freq == 0) {
> >       term++;
> >       if (term >= terms.length)
> >         return null;
> >       freq = freqs[term];
> >     }
> >     freq--;
> >     return new Token(terms[term], 0, 0);
> >   }
> > }
> >
> > Document doc = new Document();
> > doc.add(Field.Text("content", ""));
> > indexWriter.addDocument(doc, new Analyzer() {
> >   public TokenStream tokenStream(String field, Reader reader) {
> >     return new VectorTokenStream(new String[] {"java","lucene"},
> >                                  new int[] {5,6});
> >   }
> > });
> >
> > >       Too bad the Field class is final, otherwise I can derive from it
> > > and do something on that line...
> >
> > Extending Field would not help.  That's why it's final.
> >
> > Doug
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org 
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org 
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org 
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org 
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org 
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org 
For additional commands, e-mail: lucene-user-help@jakarta.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message