lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <karl.wet...@gmail.com>
Subject Re: Extract the text that was indexed
Date Wed, 31 Dec 2008 00:30:05 GMT
30 dec 2008 kl. 17.13 skrev Lebiram:

Hi Lebiram,

contrib/misc contains a couple of tools that might be of help.

> Just wanted to reconstruct a new index based on an existing  
> index(but turning off norms) that's all.

If you want to create an identical index but without norms use  
FieldNormModifier in contrib/miscellaneous.

> However, as it is nearly impossible to extract the terms  of  
> unstored fields, we might think of other ways.

Not impossible, just time consuming. The easiest way is to reconstruct  
the token stream of each field using the term frequency vector. If you  
haven't stored it there is a class called TermVectorAccessor in  
contrib/miscellaneous that allows you to visit the term vector even  
though it is not store, i.e. it will construct it be enumerating the  
inverted index.

Remember that if you reconstruct a token stream via the term vector no  
payloads will be available. If you use payloads it would be a simple  
thing to patch TermVectorAccessor in order to set the payloads in the  
tokens. Feel free to post such a patch in the Jira, it would be a nice  
addition to that code.


     karl


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message