lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <ysee...@gmail.com>
Subject Re: intra-word delimiters
Date Tue, 16 Aug 2005 02:47:00 GMT
That was the plan, but step (4) really seems problematic.

- term expansion this way can lead to a lot of false matches
- phrase queries with many bordering words break
- settingt term positions such that phrase queries work on all combos
of subwords is non-trivial.

It seems like a better approach might be a new query type that can
handle things like this.

As an example, consider a-b-c-d (4 subwords)... one way of indexing
the tokens would be:

Pos0: a
Pos1: b,  ab,  a
Pos2: c,  bc,  abc,  cd
Pos3: d,  abcd

There are only 10 uniqe tokens n(n/2+1/2), but I needed to index 11 in
order to satisfy all possible phrase queries of catenated subwords. 
Notice how many other things will now match though (ac, aab,
aababcabcd, etc).

In addition, any algorithm I come up with to generate those term
position uses even more terms than the hand-generated one above.

By using index expansion in this manner, we have lost info about the
original ordering.  A new type of fuzzy phrase query seems like it
might be able to do a better job in many circumstances.

-Yonik


On 8/15/05, Marvin Humphrey <marvin@rectangular.com> wrote:
> 1) Lowercase.
> 2) Convert non-alphanumeric characters to spaces.
> 3) Introduce a space at every boundary between a letter and a number.
> 4) concatenate all 1, 2, 3 .. n term combinations and index them.
> 5) Don't stem.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message