lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1728) Move SmartChineseAnalyzer & resources to own contrib project
Date Tue, 21 Jul 2009 09:20:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733544#action_12733544
] 

Robert Muir commented on LUCENE-1728:
-------------------------------------

Simon, I agree with you, there is a ton of work to be done. 

I also did not particularly like my method of moving everything into one package to hide the
internals... and I 100% agree that a "correct" refactoring is quite a bit of work. 

I don't want to sound like a complainer since I don't have a patch to fix these things, but
I want to list some things that I would like to fix/refactor also.
* removal of GB2312 dictionary dependency: this limits functionality to simplified chinese.
* use of unicode categories (java Character class, etc) versus Utility.getCharType()
* support for codepoints outside of BMP, this is necessary to support traditional chinese.
* a little more flexibility with tokenization, honestly I'm really not sold on indexing "words"
for chinese in the first place. But words + bigrams (overlapping tokens), that would be nice.

In the future it would be nice to add support for traditional chinese, and there is frequency
data out there (libtabe: BSD license, etc), but we need to refactor first.

As far as what to do for 2.9... I really don't know either, just let me know if you need a
new patch :)


> Move SmartChineseAnalyzer & resources to own contrib project
> ------------------------------------------------------------
>
>                 Key: LUCENE-1728
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1728
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1728.txt, LUCENE-1728.txt, LUCENE-1728.txt
>
>
> SmartChineseAnalyzer depends on  a large dictionary that causes the analyzer jar to grow
up to 3MB. The dictionary is quite big compared to all the other resouces / class files contained
in that jar. 
> Having a separate analyzer-cn contrib project enables footprint-sensitive users (e.g.
using lucene on a mobile phone) to include analyzer.jar without getting into trouble with
disk space.
> Moving SmartChineseAnalyzer to a separate project could also include a small refactoring
as Robert mentioned in [LUCENE-1722|https://issues.apache.org/jira/browse/LUCENE-1722] several
classes should be package protected, members and classes could be final, commented syserr
and logging code should be removed etc.
> I set this issue target to 2.9 - if we can not make it until then feel free to move it
to 3.0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message