lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rajesh Munavalli" <raje...@dessci.com>
Subject RE: intra-word delimiters
Date Tue, 16 Aug 2005 04:08:52 GMT
How about this....

(1) Follow the first three steps mentioned by Marvin
(2) For step 4 instead of trying to come up with all concatenation, create concatenations
of 3 words

For example an entity of length n, W1 W2 W3....Wn will have the index positions as follows

Index Position   Value
0                W1W2W3
                 W1W2
                 W1
1                W2W3W4
                 W2W3
                 W4

and so on...

(3) At the time of querying follow the steps 1 and 2 to create the query terms with respective
BOOST levels as follows....
QUERY: Power Shot SD 500
PROCESSED QUERY: power shot sd 500

Query Terms ORed:
powershotsd^3, powershot^2, power
shotsd500^3, shotsd^2, shot
sd500^2, sd
500

QUERY: CanonPowerShotSD500
PROCESSED QUERY: canonpowershotsd 500
Query Terms ORed:
canonpowershotsd500^2, canonpowershotsd

QUERY: Canon-Powershot-SD 500
PROCESSED QUERY: canon powershot sd 500
Query Terms ORed:
canonpowershotsd^3, canonpowershot^2, canon
powershotsd500^3, powershotsd^2, powershot
sd500^2, sd
500

Hopefully 3 grams should be good enough to capture enough information to retrieve useful documents.
You can adjust the boost levels and assess the performance.

Hope it helps...

Rajesh Munavalli


-----Original Message-----
From: Yonik Seeley [mailto:yseeley@gmail.com]
Sent: Mon 8/15/2005 7:47 PM
To: java-user@lucene.apache.org
Subject: Re: intra-word delimiters
 
That was the plan, but step (4) really seems problematic.

- term expansion this way can lead to a lot of false matches
- phrase queries with many bordering words break
- settingt term positions such that phrase queries work on all combos
of subwords is non-trivial.

It seems like a better approach might be a new query type that can
handle things like this.

As an example, consider a-b-c-d (4 subwords)... one way of indexing
the tokens would be:

Pos0: a
Pos1: b,  ab,  a
Pos2: c,  bc,  abc,  cd
Pos3: d,  abcd

There are only 10 uniqe tokens n(n/2+1/2), but I needed to index 11 in
order to satisfy all possible phrase queries of catenated subwords. 
Notice how many other things will now match though (ac, aab,
aababcabcd, etc).

In addition, any algorithm I come up with to generate those term
position uses even more terms than the hand-generated one above.

By using index expansion in this manner, we have lost info about the
original ordering.  A new type of fuzzy phrase query seems like it
might be able to do a better job in many circumstances.

-Yonik


On 8/15/05, Marvin Humphrey <marvin@rectangular.com> wrote:
> 1) Lowercase.
> 2) Convert non-alphanumeric characters to spaces.
> 3) Introduce a space at every boundary between a letter and a number.
> 4) concatenate all 1, 2, 3 .. n term combinations and index them.
> 5) Don't stem.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




Mime
View raw message