lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lance Norskog (JIRA)" <>
Subject [jira] [Commented] (SOLR-3653) Custom bigramming filter for to handle Smart Chinese edge cases
Date Mon, 24 Sep 2012 02:49:07 GMT


Lance Norskog commented on SOLR-3653:

Another note: one trigram is the number 15. There are several conventions for representing
integers, including regional quirks. There is no 'number canonicalizer' in the Smart Chinese
toolkit. This could be a problem with formal documents: historical, government docs, treaties
and the like.

> Custom bigramming filter for to handle Smart Chinese edge cases
> ---------------------------------------------------------------
>                 Key: SOLR-3653
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Lance Norskog
>         Attachments: SmartChineseType.pdf, SOLR-3653.patch, translations_450.five2thirteen.txt,
translations_first_500.quad.txt, translations_first_500.trigrams.txt
> The "Smart" Simplified Chinese toolkit in lucene/analysis/smartcn does not work in some
edge cases. It fails to split certain words which were not part of the dictionary or training
> This patch supplies a bigramming class to handle these occasional mistakes. The algorithm
creates bigrams out of all "words" longer than two ideograms.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message