lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lance Norskog (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3653) Support Smart Simplified Chinese in Solr - include clean-up bigramming filter
Date Fri, 20 Jul 2012 08:07:34 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418997#comment-13418997
] 

Lance Norskog commented on SOLR-3653:
-------------------------------------

The SmartChineseWordTokenFilter is a statistical algorithm (Hidden Markov Model to be exact)
which was trained on a corpus of training text. It's purpose is to split text into "words",
which are singles, bigrams and occasionally trigrams of Simplified Chinese ideograms (letters).
It does a very good job, but since it is statistically based it is not perfect. When it fails,
it emits "words" that are 4 or more ideograms. These are really phrases. These phrases contain
real words which should be searchable.

The attached PDF of the Analysis page shows the problem. Chinese legal text proved a pathological
case and created a 7-ideogram word. In order to make parts of this text searchable, the 7-letter
phrase has to be broken into n-grams. Unigrams give more recall while bigrams give more precision.


This patch includes a new SmartChineseBigramFilter takes any words not split by the WordTokenFilter
and creates bigrams from them. The bigrams only span the unsplit phrase. They do not overlap
between two adjoining unsplit phrases. The attached PDF shows this effect as well between
the first and second unsplit phrases.

I am not an expert on the Chinese language or the HMM technology used in the Smart Chinese
toolkit. I created the bigram filter after difficulties attempting to supply a high-quality
search experience for Chinese legal documents. This is a straw-man solution to the problem.
If you know better, please say so and we will iterate.

The patch includes a 'text_zh' field type which includes the bigram filter. The bigram filter
is essential if 'text_zh' is to be the preferred recommendation.
                
> Support Smart Simplified Chinese in Solr - include clean-up bigramming filter
> -----------------------------------------------------------------------------
>
>                 Key: SOLR-3653
>                 URL: https://issues.apache.org/jira/browse/SOLR-3653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Lance Norskog
>         Attachments: SOLR-3653.patch, SmartChineseType.pdf
>
>
> The "Smart" Simplified Chinese toolkit in lucene/analysis/smartcn has no Solr factories.
Also, since it is a statistical algorithm, it is not perfect.
> This patch supplies factories and a schema.xml type for the existing Lucene Smart Chinese
implementation, and includes a "fixup" class to handle the occasional mistake made by the
Smart Chinese implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message