lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lance Norskog (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3653) Support Smart Simplified Chinese in Solr - include clean-up bigramming filter
Date Fri, 20 Jul 2012 16:35:35 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419326#comment-13419326
] 

Lance Norskog commented on SOLR-3653:
-------------------------------------

bq. Actually there are factories in contrib/analysis-extras.
You're right, I was thinking of a previous project.
bq. I am not sure on this: if someone wants to mix an n-gram technique with a word model,
they can just use two fields? If they want to limit the n-gram field to only longer terms,
they should use LengthFilter.

Is this the design?
{code}
Word-based field: 
    SmartChineseWordTokenFilter -> LengthFilter accept 1-3 letters
Bigram-based field:
    SmartChineseWordTokenFilter -> LengthFilter accept 4 or longer -> Chinese-only bigrams
{code}
This works if the user searches simple words, like on a consumer site. In the legal document
site, people block-copy 60-word document titles and expect to find the matching title first
on the list. This requires a phrase search where 0 variations in position gives the exact
title. If the two classes of terms are in two different fields, will that work? I did not
think parsers did 

Also, this design needs to allow for mixed language text: year numbers, English words. Are
the existing Lucene filters flexible enough to do this?

bq. The word you are upset about (中华人民共和国) is in the smartcn dictionary. As
I understand, this word basically means "PRC". This is a single concept and makes sense as
an indexing unit. Why do we care how long it is in characters?

Because parts of it are also words, which should be searchable. Here are two more failed words:
"个人所得税" (personal/individual "income tax") and "社会保险" (National Congress,
political body). I can imagine Congress would be in the dictionary, but "personal income tax"?
If you search for income tax: "所得税" you will not find personal income tax. This points
up a flaw: the bigram trick will not find this trigram.

How do you know what's in the dictionary? The files are in a .mem format. I can't find a main
program for them.



                
> Support Smart Simplified Chinese in Solr - include clean-up bigramming filter
> -----------------------------------------------------------------------------
>
>                 Key: SOLR-3653
>                 URL: https://issues.apache.org/jira/browse/SOLR-3653
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Lance Norskog
>         Attachments: SOLR-3653.patch, SmartChineseType.pdf
>
>
> The "Smart" Simplified Chinese toolkit in lucene/analysis/smartcn has no Solr factories.
Also, since it is a statistical algorithm, it is not perfect.
> This patch supplies factories and a schema.xml type for the existing Lucene Smart Chinese
implementation, and includes a "fixup" class to handle the occasional mistake made by the
Smart Chinese implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message