lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michał Dybizbański (JIRA) <j...@apache.org>
Subject [jira] [Updated] (LUCENE-2341) explore morfologik integration
Date Tue, 21 Jun 2011 21:17:53 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michał Dybizbański updated LUCENE-2341:
---------------------------------------

    Attachment: LUCENE-2341.diff

Thank you guys for suggestions :)

I've changed the diff to include them:

1. Implemented MorfologikFilter.reset that resets the stemsAcc, and added a test case that
would fail in the absence of that implementation, which exhibits the behaviour mentioned by
Robert.

2. Updated modules/analysis/NOTICE.txt and modules/analysis/LICENSE.txt - Robert, is that
what you meant, or do they need to include more information ?

3. MorfologikFilter now uses an explicit pointer to not modify the stemsAcc on each pass -
Dawid, do you think it's reasonable to optimize further and use directly a list returned by
IStemmer.lookup (instead of copying with addAll) ? My concern is that (at least in current
DictionaryLookup implementation) that list seems to be shared by distinct invocations of the
lookup method, which would make the use of a specific IStemmer not applicable in thread-safe
code.

4. Removed explicit call to getStem().toString().


As for the new Morfologik version, I've been thinking it would be better to alter the constructors
of MorfologikAnalyzer and MorfologikFilter to accept concrete IStemmer implementations, instead
of a languageCode String as they do now. This way, org.apache.lucene.analysis.morfologik package
wouldn't depend on current implementations of IStemmer (only on the interface), and also allowed
future ones to be used without changing the package. What do you think ?

That could also solve the case of a custom attribute for POS tags (MorfologikPOSAttribute
?) : since a client would instantiate their IStemmer explicitly, they would know the meaning
of the attribute's value. That doesn't take into account the DICTIONARY.COMBINED stemmer,
but the same seems to apply to the Morfologik library itself (I mean, for a specific WordData
from IStemmer.lookup there is no information on which of the internal concrete DictionaryLookup
it comes from). Dawid - what do you think of that issue ?


> explore morfologik integration
> ------------------------------
>
>                 Key: LUCENE-2341
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2341
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Assignee: Dawid Weiss
>         Attachments: LUCENE-2341.diff, LUCENE-2341.diff, morfologik-stemming-1.5.0.jar
>
>
> Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer available:
> http://sourceforge.net/projects/morfologik/
> This works differently than LUCENE-2298, and ideally would be another option for users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message