lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <ysee...@gmail.com>
Subject Re: dash-words
Date Tue, 25 Jul 2006 15:42:35 GMT
On 7/25/06, Martin Braun <mbraun@uni-hd.de> wrote:
> Hi Yonik,
>
> >> I can't figure out what the parameters does. ;)
> >
> > Yes, it will fail without slop... I don't think there is a practical
> > way around that.
>
> I am trying to analyze your WordDelimiterFilter.
>
> If I have x-men, after analyzing (with catenateAll) I get this:
>
>
>  Analzying "The x-men story"
>         de.unihd.ub.ftsearch.WordIndexAnalyzer:
>                 [the] [x] [men] [xmen] [story]
>
>
> 1: [the:0->3:word]
> 2: [x:4->5:word]
> 3: [men:6->9:word] [xmen:4->9:word]
> 4: [story:10->15:word]
>
> 1: [the]
> 2: [x]
> 3: [men] [xmen]
> 4: [story]
>
>
> So a Phrase search to "The xmen story" will fail. With a slop of 1 the
> doc will be found.
>
> But when generating the query I won't know when to use a slop. So adding
> slops isn't a nice solution.

If you can't tolerate slop, this is a problem.

The only 100% solution that I could think of to this problem is to
re-index the entire stream (with a very large position gap inbetween)
for each variant.

"the x men story"
"the xmen story"

Problems:
1) combinatorial explosion very quickly (not practical at all)
2) messes up idfs pretty badly

Phrase slop is the easiest workaround, esp when you wanted slop anyway.

> Would it be a solution, to take the concatenated synonyms to both
> Positions? Or are there any drawbacks with this?

I considered that too... but it increases false matches, and it still
doesn't fix many phrase queries.

> 1: [the]
> 2: [x] [xmen]
> 3: [men] [xmen]
> 4: [story]

While "the xmen" and "xmen story" will now both match, "the xmen
story" will still fail to match.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message