lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wettin (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
Date Tue, 02 Jun 2009 21:53:07 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715712#action_12715712
] 

Karl Wettin commented on LUCENE-1491:
-------------------------------------

Although you have a valid point I'd like to argue this a bit. 

My arguments is probably considered silly by some. Perhaps it's just me that use ngrams for
something completly different than what everybody else does, but here we go: Adding the feature
as suggested by this patch is, according to me, to fix symptoms from bad use of character
ngrams.

BOL, EOL, whitespace and punctuation are all valid parts of character ngrams than can increase
precision/recall quite a bit. EdgeNGrams could sort of be considered such data too. So what
I'm saying here is that I consider your example a bad use of charachter ngrams, that the whole
sentance should have been grammed up. So in the case of 4-grams the output would end up as:
"to b", "o be", " be ", "be o", and so on. Perhaps even "$to ", "to b", "o be", and so on.

Supporting what I suggest will of course mean quite a bit of more work. A whole new filter
that also does input text normalization such as removing double spaces and what not. That
will probably not be implemented anytime soon. But adding the features in the patch to the
filter actually means that this use is endorsed by the community and I'm not sure that's a
good idea. I thus think it would be better with some sort of secondary filter that did the
exact same thing as the patch.

Perhaps I should leave this issue alone and do some more work with LUCENE-1306 

> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1491
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1491
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>            Reporter: Todd Feak
>            Assignee: Otis Gospodnetic
>             Fix For: 2.9
>
>         Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the min gram size,
the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. Wanted to
get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message