lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2407) make CharTokenizer.MAX_WORD_LEN parametrizable
Date Wed, 21 Apr 2010 14:39:49 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859373#action_12859373
] 

Uwe Schindler commented on LUCENE-2407:
---------------------------------------

This is also a problem for some asian languaes. If ThaiAnalyzer would use CharTokenizer, very
long passages could get lost, as ThatWordFilter would not get the complete string (thai is
not tokenized by the tokenizer, but later in the filter)

This also applies to StandardTokenizer, maybe we should set a good default when analyzing
Thai text (ThaiAnalyzer should init StandardTokenizer with a large/infinite value).

> make CharTokenizer.MAX_WORD_LEN parametrizable
> ----------------------------------------------
>
>                 Key: LUCENE-2407
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2407
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 3.0.1
>            Reporter: javi
>            Priority: Minor
>             Fix For: 3.1
>
>
> as discussed here http://n3.nabble.com/are-long-words-split-into-up-to-256-long-tokens-tp739914p739914.html
it would be nice to be able to parametrize that value. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message