lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Paul Sondag" <jsond...@uiuc.edu>
Subject Re: Does Index have a Tokenizer Built into it
Date Fri, 13 Jul 2007 23:18:56 GMT
Ard,

I do have access to the URL's of the documents, but because I will be making
short snippets for many pages (suppose it had about 20 hits per page and I
need to make Snippets for each of them) I was worried it would be
inefficient to open each "hit" tokenize it and then make the Snippet, of
course the price of this may be worth the price of the increased Index
size.  I have been looking into storing "Field Vectors with positions" in
the index.  It seems that by doing this I will have access to everything
that the Tokenizer is giving me correct?   Will I need to store "term text"
in order to be able to access the actual term instead of stemmed words?

Thanks for all your help,

--JP

On 7/13/07, Ard Schrijvers <a.schrijvers@hippo.nl> wrote:
>
> Hello,
>
> > I'm wondering if after
> > opening the
> > index I can retrieve the Tokens (not the terms) of a
> > document, something
> > akin to IndexReader.Document(n).getTokenizer().
>
> It is obviously not possible to get the original tokens of the document
> back when you haven't stored the document, because:
>
> 1) the analyzer might have removed stop words in the first place
> 2) the terms in lucene index are perhaps stemmed words / synonyms / etc
> etc
> 3) how would you expect things like spaces, commas, dots etc to be
> restored?
>
> And, I think what you want does not comply with an inverted index. When
> you do not store the document, you always loose information about the
> document during indexing/analyzing
>
> How many documents are you talking about? They must be either somewhere on
> FS or accessible over http...when you need the document, why not just
> provide a link to the original location?
>
> Regards Ard
>
> >
> > In summary:
> >
> > My current ( too wasteful implementation is this)
> >
> > StandardTokenizer(BufferedReader (
> > IndexReader.Document(n).getField("text"
> > )  )
> >
> > I'm wondering if Lucene has a more efficient manner to
> > retrieve the tokens
> > of a document from an index.  Because it seems like it has
> > information about
> > every "term" already, Since you can get retrieve a
> > TermPositions object.
> >
> > Thanks,
> >
> >
> > --JP
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message