lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@cs.put.poznan.pl>
Subject Synonym filter with support for phrases?
Date Wed, 22 Apr 2009 08:49:51 GMT

Hello everyone,

I'm looking for feedback and thoughts on the following problem (it's more of 
development than user-centered problem, hope the dev list is appropriate):

- a token stream is given,

- a set of "synonyms" is given, where synonyms are token sequences to be matched 
and token sequences to be added as synonyms.

An example to make things clearer (apologies for lame synonyms). Given a set of 
synonyms like this:

{"new", "york"} -> {
	{"big", "apple"}},

{"restaurant"}  -> {
	{"diner"},
	{"food", "place"},
	{"full", "belly"}}
}

a token stream (I try to indicate positional information here):

0 | 1   | 2          | 3  | 4   | 5
a | new | restaurant | in | new | york

would be enriched to an index of (note overlapping tokens on the same positions):

0 | 1   | 2          | 3     | 4   | 5
a | new | restaurant | in    | new | york
   |     | diner      |       | big | apple
   |     | food       | place |     |
   |     | full       | belly |     |

The point is for phrase queries to work for synonyms and for the original text 
(of course multi-word synonyms longer than the original phrase would overlap 
with the text, but this shouldn't be much of a worry).

In the current Lucene's trunk there is a synonym filter, but its implementation 
is not really suitable for achieving the above. I wrote a token filter that 
implements the above functionality, but then I thought that synonyms would be 
something frequently dealt with so my questions are:

a) are there any thoughts on how the above could be implemented using existing 
Lucene infrastructure (perhaps I missed something obvious),

b) if (a) is not applicable, would such a token filter constitute a useful 
addition to Lucene?

Dawid


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message