lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lance Norskog (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3653) Custom bigramming filter for to handle Smart Chinese edge cases
Date Mon, 24 Sep 2012 02:47:08 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461572#comment-13461572
] 

Lance Norskog commented on SOLR-3653:
-------------------------------------

I ran some counts on a database of 300k Chinese legal documents. The index has a unigram field
based on the StandardAnalyzer, a bigram field based on the CJK analyzer, and a Smart Chinese
field. I pulled the terms for all of them and filtered for Chinese ideograms only. These are
text unigrams, with 

* The unigram field had 55k terms. 
* The bigram field had 1.8 million terms. 
* The Smart Chinese field had 417k terms:
** unigrams: 9.6k
** bigrams: 40k
** trigrams: 14.6k
** four: 5.6k
** five: 300
** six: 70
** seven: 51
** eight: 19
** nine: 7
** ten: 2
** eleven: 3
** twelve: 2
** thirteen: 3

The 4+ ngrams are essentially parsing failures by the Smart Chinese tokenizer. I have attached
three Google Translate versions of the longer ngrams. 'translations_first_500.trigrams.txt'
and 'translations_first_500.quad.txt' are the most common 3-ideogram and 4-ideogram terms.
They have a lot of phrases which should have been split.  'translations_450.five2thirteen.txt'
are 450 ngrams which are 5 ideograms or longer.  The longer ones have a lot of formal geographical
names, government organization names and official propaganda phrases, more as the length increases.


For this corpus, based the above breakdown and on other experience:
# CJK is a waste of disk space. Bigrams introduce a ton of noise.
# Unigrams might work well if you only do strict phrase searches. But searching for A, B,
and C separately when given ABC is useless.
# If you search for raw country names, Smart Chinese lets you down when the document uses
the formal name. 

Smart Chinese really does need to be split into bigrams. To cut bigram noise, I would take
the database of bigrams that it generates, and then use these to guide splitting 3+ grams
into bigrams. That is, if it ever generates AB, then the splitter turns ABCD into (AB CD).
BC would be considered 'bigram noise'. Similarly, if Smart Chinese generates EF, then DEFG
would become (D EF G).

However, a good fallback would be to have two fields, Smart Chinese and unigrams, with Smart
Chinese boosted upwards and unigrams only with strict phrase search. With a high term count,
bigrams are not helpful. You might even want to search Smart Chinese first, and then do unigram
loose phrase search only if the recall is too low or the user is unhappy with the Smart Chinese
results.

                
> Custom bigramming filter for to handle Smart Chinese edge cases
> ---------------------------------------------------------------
>
>                 Key: SOLR-3653
>                 URL: https://issues.apache.org/jira/browse/SOLR-3653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Lance Norskog
>         Attachments: SmartChineseType.pdf, SOLR-3653.patch
>
>
> The "Smart" Simplified Chinese toolkit in lucene/analysis/smartcn does not work in some
edge cases. It fails to split certain words which were not part of the dictionary or training
corpus. 
> This patch supplies a bigramming class to handle these occasional mistakes. The algorithm
creates bigrams out of all "words" longer than two ideograms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message