lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-759) Add n-gram tokenizers to contrib/analyzers
Date Thu, 01 Mar 2007 22:49:51 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477115
] 

Doron Cohen commented on LUCENE-759:
------------------------------------

I have two comments/questions on the n-gram tokenizers:

(1) Seems that only the first 1024 characters of the input are handled, and the rest is ignored
(and I think as result the input stream would remain dangling open). 

If you add this test case:

    /**
     * Test that no ngrams are lost, even for really long inputs
     * @throws EXception
     */
    public void testLongerInput() throws Exception {
      int expectedNumTokens = 1024;
      int ngramLength = 2;
      // prepare long string
      StringBuffer sb = new StringBuffer();
      while (sb.length()<expectedNumTokens+ngramLength-1) 
        sb.append('a');
      
      StringReader longStringReader = new StringReader (sb.toString());
      NGramTokenizer tokenizer = new NGramTokenizer(longStringReader, ngramLength, ngramLength);
      int numTokens = 0;
      Token token;
      while ((token = tokenizer.next())!=null) {
        numTokens++;
        assertEquals("aa",token.termText());
      }
      assertEquals("wrong number of tokens",expectedNumTokens,numTokens);
    }

With expectedNumTokens = 1023 it would pass, but any larger number would fail. 

(2) It seems safer to read the characters like this
            int n = input.read(chars);
            inStr = new String(chars, 0, n);
(This way not counting on String.trim(), which does work, but worries me).



> Add n-gram tokenizers to contrib/analyzers
> ------------------------------------------
>
>                 Key: LUCENE-759
>                 URL: https://issues.apache.org/jira/browse/LUCENE-759
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Otis Gospodnetic
>         Assigned To: Otis Gospodnetic
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: LUCENE-759.patch, LUCENE-759.patch, LUCENE-759.patch
>
>
> It would be nice to have some n-gram-capable tokenizers in contrib/analyzers.  Patch
coming shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message