lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <a.schrijv...@hippo.nl>
Subject RE: Does Index have a Tokenizer Built into it
Date Fri, 13 Jul 2007 07:58:18 GMT
Hello,

> I'm wondering if after 
> opening the
> index I can retrieve the Tokens (not the terms) of a 
> document, something
> akin to IndexReader.Document(n).getTokenizer().

It is obviously not possible to get the original tokens of the document back when you haven't
stored the document, because:

1) the analyzer might have removed stop words in the first place
2) the terms in lucene index are perhaps stemmed words / synonyms / etc etc
3) how would you expect things like spaces, commas, dots etc to be restored?

And, I think what you want does not comply with an inverted index. When you do not store the
document, you always loose information about the document during indexing/analyzing

How many documents are you talking about? They must be either somewhere on FS or accessible
over http...when you need the document, why not just provide a link to the original location?

Regards Ard

> 
> In summary:
> 
> My current ( too wasteful implementation is this)
> 
> StandardTokenizer(BufferedReader (  
> IndexReader.Document(n).getField("text"
> )  )
> 
> I'm wondering if Lucene has a more efficient manner to 
> retrieve the tokens
> of a document from an index.  Because it seems like it has 
> information about
> every "term" already, Since you can get retrieve a 
> TermPositions object.
> 
> Thanks,
> 
> 
> --JP
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message