lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Semb Wever (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1380) Patch for ShingleFilter.coterminalPositionIncrement
Date Wed, 10 Sep 2008 16:18:45 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629846#action_12629846
] 

Michael Semb Wever commented on LUCENE-1380:
--------------------------------------------

i suspected such re the option name, but "coterminal" is a word i haven't used since high
school.

> I'm -1 on the patch in its current form. If rewritten to modify the position increment
only for those shingles that begin at the same word, I'd be +1 (assuming it works and is tested
appropriately).

As i said in thread your suggestion does not work.
Setting each shingle to have a positionIncrement=1 so to avoid using the MultiPhraseQuery
in favour of the plain PhraseQuery makes sense, but does not work. And not phrasing the query
doesn't invoke the ShingleFilter properly.

> The ShingleFilter appears to only work, at least for me, on phrases.
> I would think this correct as each shingle is in fact a sub-phrase to the larger original
phrase.

If this is the case, ie ShingleFilter works on phrases as a whole entity, and that shingles
from each term in the phrase do have a relationship as they all come from the one phrase,
then does it not make sense to have the possibility to position them altogether.

For example in the current implementation, in the phrase "abcd efgh ijkl" it is the first
term "abcd" that is responsible for generating the shingles "abcd efgh ijkl" and "abcd efgh".

What  says that these shingles couldn't be generated from the "efgh" (or "ijkl" for the former
shingle) term in an alternative implementation?
Why the presumption that it's in the user's interest to force this separation between where
this implementation chooses to put its shingles?

If this isn't lost-in-the-bush-logic, have you a suggestion for a more appropriate option
name for the current solution?

> Patch for ShingleFilter.coterminalPositionIncrement
> ---------------------------------------------------
>
>                 Key: LUCENE-1380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Michael Semb Wever
>             Fix For: 2.4
>
>         Attachments: LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same position.
> Default is to place each shingle at the same position as the unigram (or first shingle
if outputUnigrams=false). That is, each coterminal token has positionIncrement=1 and every
other token a positionIncrement=0. 
> This leads to a MultiPhraseQuery where at least one word/shingle must be matched from
each word/token. This is not always desired. 
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message