lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roxana Angheluta <>
Subject Re: index phrases
Date Wed, 22 Jun 2005 07:43:11 GMT

>> Dear all,
>> I am using Lucene for indexing documents.
>> I would like to include phrases (of a certain maximum length given  
>> as a parameter) in the index. I know this is non-standard for e.g.  
>> searching, where a PhraseQuery can be built which makes use of the  
>> terms positions. However, I am not interested in searching, but  
>> rather in using the indexing terms for some statistics.
>> What would be an efficient way to do this? Is it possible to build  
>> phrases in a filter after tokenization?
> Roxana- could you give us a concrete example of what you're wanting  
> to do?
> A TokenFilter could certainly be used to aggregate multiple terms  
> into a single term that represents a phrase.  This would happen  
> during the analysis process, which occurs along with tokenization.
Hi Erik, thanks for the answer.
I would like to index the following document:

This is a sample document.

something like:
"this is"
"is a"
"a sample"
"this a"
"is sample"
"a document"
"sample document"
"this is a"
"is a sample"
"a sample document"

In this example the maximum length of an n-gram is 3 and the length of 
the moving window accross text is also 3.
In fact I would like a full analyzer to do the job, i.e. define a 
strategy to filter out/clean spurious n-grams: e.g. remove n-grams made 
out only/partially of stopwords, eliminate just stopwords from the n-gram.

Sebastian has kindly provided his code, which does the job.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message