lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2404) Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing and also fix some bugs (empty tokens stop iteration)
Date Mon, 19 Apr 2010 18:17:52 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858618#action_12858618
] 

Robert Muir commented on LUCENE-2404:
-------------------------------------

This is great. it already more than doubles the speed of this filter on english text...

but this filter has always been cheating with the UnicodeBlock check on charAt(0), as you
could have EnglishThaiEnglish too.
it also cheats because it doesn't check that the break boundaries are words, and not just
spaces or punctuation.

I suppose the above two things are not much of a problem if you assume StandardTokenizer,
but maybe a problem for
other Tokenizers... tricky to figure out how to make it correct and still as fast as the 'cheating'


> Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing and also
fix some bugs (empty tokens stop iteration)
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2404
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2404
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Uwe Schindler
>            Assignee: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2404.patch, LUCENE-2404.patch
>
>
> The ThaiWordFilter creates new Strings out of term buffer before passing to The BreakIterator.,
But BreakIterator can take a CharacterIterator and directly process on it without buffer copying.
> As Java itsself does not provide a CharacterIterator implementation in java.text, we
can use the javax.swing.text.Segment class, that operates on a char[] and is even reuseable!
This class is very strange but it works and is in JDK 1.4+ and not deprecated.
> The filter also had a bug: It stopped iterating tokens when an empty token occurred.
Also the lowercasing for non-thai words was removed and put into the Analyzer by adding LowerCaseFilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message