lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Marius Kirsch <skir...@sebastian-kirsch.org>
Subject Re: index phrases
Date Tue, 21 Jun 2005 16:46:11 GMT
On Tue, Jun 21, 2005 at 04:01:41PM +0200, Roxana Angheluta wrote:
> I would like to include phrases (of a certain maximum length given as a 
> parameter) in the index. I know this is non-standard for e.g. searching, 
> where a PhraseQuery can be built which makes use of the terms positions. 
> However, I am not interested in searching, but rather in using the 
> indexing terms for some statistics.

I built a filter for exactly this purpose (statistics over word
combinations); I will send it to you directly. It uses some containers
from org.apache.commons.collections, so I guess it's not suitable yet
for inclusion in the lucene distribution.

The filter handles position increments > 1 by inserting fillers, so a
phrase "look at my car" will become "look _ _ car" when the words "at"
and "my" are filtered out from the stream, and "car" is assigned a
position increment of 3. But since StopFilter does not currently
produce position increments, you will hardly ever see them. The filter
does not handle position increments == 0. (But will produce them for
ngrams that start at the same position.)

I think there is also some stuff for ngram indexing in nutch.

Regards, Sebastian

-- 
Sebastian Kirsch <skirsch@sebastian-kirsch.org> [http://www.sebastian-kirsch.org/]

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message