lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2400) ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute
Date Sun, 18 Apr 2010 22:03:49 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858353#action_12858353
] 

Steven Rowe commented on LUCENE-2400:
-------------------------------------

Uwe told me on #lucene-dev that without adding the specialized CharTermAttribute methods to
the interface, they wouldn't get invoked, and so since I didn't, the numbers in the previous
post are meaningless.

So, I applied LUCENE-2401 to add the correct form of the specializations, then re-ran the
shingle alg, and it looks like there is no longer a penalty for using the shorthand form Uwe
suggested.  Here are the numbers:

JAVA:
java version "1.6.0_13"
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)

OS:
cygwin
WinVistaService Pack 2
Service Pack 26060022202561

||Max Shingle Size||Unigrams?||Unpatched||Patched||StandardAnalyzer||Improvement||
|2|no|3.21s|3.31s|2.12s|-8.3%|
|2|yes|3.40s|3.54s|2.12s|-9.8%|
|4|no|4.17s|4.57s|2.12s|-16.2%|
|4|yes|4.33s|4.75s|2.12s|-15.9%|


> ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute
to CharTermAttribute
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2400
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 3.0.1
>            Reporter: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2400.patch, LUCENE-2400.patch, LUCENE-2400.patch
>
>
> When the input token stream to ShingleFilter has position increments greater than one,
filler tokens are inserted for each position for which there is no token in the input token
stream.  As a result, unigrams (if configured) and shingles can be filler-only.  Filler-only
output tokens make no sense - these should be removed.
> Also, because TermAttribute has been deprecated in favor of CharTermAttribute, the patch
will also convert TermAttribute usages to CharTermAttribute in ShingleFilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message