lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Otis Gospodnetic (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-1003) [PATCH] RussianAnalyzer's tokenizer skips numbers from input text,
Date Sun, 17 Feb 2008 08:22:34 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Otis Gospodnetic updated LUCENE-1003:
-------------------------------------

    Lucene Fields: [New, Patch Available]  (was: [New])
         Assignee: Otis Gospodnetic

TUSUR OpenTeam: would it be possible to get a unit test, too?


> [PATCH] RussianAnalyzer's tokenizer skips numbers from input text,
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1003
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1003
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: TUSUR OpenTeam
>            Assignee: Otis Gospodnetic
>         Attachments: RussianCharsets.java.patch
>
>
> RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream
miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See
test case below  for details.
> {code:title=TestRussianAnalyzer.java|borderStyle=solid}
> public class TestRussianAnalyzer extends TestCase {
>   Reader reader = new StringReader("text 1000");
>   // test FAILS
>   public void testStemmer() {
>     testAnalyzer(new RussianAnalyzer());
>   }
>   // test PASSES
>   public void testFixedRussianAnalyzer() {
>     testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
>   }
>   private void testAnalyzer(RussianAnalyzer analyzer) {
>     try {
>       TokenStream stream = analyzer.tokenStream("text", reader);
>       assertEquals("text", stream.next().termText());
>       assertNotNull(stream.next());
>     } catch (IOException e) {
>       fail(e.getMessage());
>     }
>   }
>   private char[] getRussianCharSet() {
>     int length = RussianCharsets.UnicodeRussian.length;
>     final char[] russianChars = new char[length + 10];
>     System
>         .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
>     russianChars[length++] = '0';
>     russianChars[length++] = '1';
>     russianChars[length++] = '2';
>     russianChars[length++] = '3';
>     russianChars[length++] = '4';
>     russianChars[length++] = '5';
>     russianChars[length++] = '6';
>     russianChars[length++] = '7';
>     russianChars[length++] = '8';
>     russianChars[length] = '9';
>     return russianChars;
>   }
> }
> {code} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message