lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-2404) Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing and also fix some bugs (empty tokens stop iteration)
Date Mon, 19 Apr 2010 18:01:51 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-2404:
----------------------------------

    Attachment: LUCENE-2404.patch

New patch, which preserves backwards with matchVersion. It adds an LowerCaseFilter in the
ctor of ThaiWordFilter automatically, so the bahviour does not change, except a second bug:
The previous version of the filter did not correctly lowercase a token that contains "ThaiEnglishThai"
text. As the filter is now plugged before, it will lowercase this correctly, so its a backwards
break.

Since Version 3.1, the filter is no longer automatically used, instead ThaiAnalyzer plugs
it before the filter (I reversed the order in contrast to previous patch to have the same
order in deprecated and actual case).

> Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing and also
fix some bugs (empty tokens stop iteration)
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2404
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2404
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Uwe Schindler
>            Assignee: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2404.patch, LUCENE-2404.patch
>
>
> The ThaiWordFilter creates new Strings out of term buffer before passing to The BreakIterator.,
But BreakIterator can take a CharacterIterator and directly process on it without buffer copying.
> As Java itsself does not provide a CharacterIterator implementation in java.text, we
can use the javax.swing.text.Segment class, that operates on a char[] and is even reuseable!
This class is very strange but it works and is in JDK 1.4+ and not deprecated.
> The filter also had a bug: It stopped iterating tokens when an empty token occurred.
Also the lowercasing for non-thai words was removed and put into the Analyzer by adding LowerCaseFilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message