lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dawid Weiss (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-2341) explore morfologik integration
Date Tue, 21 Jun 2011 08:21:48 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052421#comment-13052421
] 

Dawid Weiss commented on LUCENE-2341:
-------------------------------------

I did some analyses on both dictionaries.
{noformat}
Number of lines (distict surface forms):

  3.662.366 morfologik.utf8
  5.086.141 sgjp.utf8

Distinct words (not in both):

  2.729.334 unique.utf8

  - upper/lower case (morfologik has upper case forms, morfeusz only lower case surface forms)
    
    acerze
    Acerze

  - very rare or jargon;

    abszminka
    abszytowałem
    acetobakteria
    acetarsolowi
    niebombiasto
    hakatystce
    hakatystycznościach
    warzże

  - differences in spelling;

    abelard
    abélard

  - acronyms and super-short stuff

    aap
    aar

Dictinct normalized (lowercase):

  2.564.366 lowered.utf8

  Most of these are very infrequent words or inflection forms. There are minor differences
or
  missing surface forms in both dictionaries, as in here (mz - morfeusz, mk - morfologik):

mz> hakersko
mz> hakerskość
mz> hakerskości
mz> hakerskością
mz> hakerskościach
mz> hakerskościami
mz> hakerskościom
mk> hakerstw
mk> hakerstwa
...
mk> hakowałyśmy
mk> hakowań
mk> hakowaniach
mk> hakowaniami
mk> hakowaniom
mz> hakowatość
mz> hakowatości
mz> hakowatością
mz> hakowatościach
mz> hakowatościami
mz> hakowatościom
{noformat}

So... the conclusion is pretty consistent with Zipf's law: both dictionaries have a fairly
different coverage, even if they're quite large. We don't have a frequency dictionary for
Polish, but I assume most of these surface forms are purely theoretical and occur super-rarely
in practice. This said, I think we should use BOTH dictionaries -- after all there's no harm
done if we overdo the lemmatization process a little bit, is there?

So... my proposal would be this: I'll integrate Morfeusz's dictionary in Morfologik (as an
alternative dictionary one can load and use). 

Eventually it would be probably sensible to limit the automaton for use in Lucene to store
surface forms and lemmas only (no POS tags) and merge both dictionaries into a single automaton...
but this can  be a future improvement.



> explore morfologik integration
> ------------------------------
>
>                 Key: LUCENE-2341
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2341
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Assignee: Dawid Weiss
>         Attachments: LUCENE-2341.diff, morfologik-stemming-1.5.0.jar
>
>
> Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer available:
> http://sourceforge.net/projects/morfologik/
> This works differently than LUCENE-2298, and ideally would be another option for users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message