lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (Commented) (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3883) Analysis for Irish
Date Tue, 20 Mar 2012 03:27:44 GMT


Robert Muir commented on LUCENE-3883:

Thanks for updating the patch Jim!

one concern doing some very very rudimentary testing:

we have special lowercasing for situations like nAthair -> n-athair,

which the snowball rules then strip:

define initial_morph as (
  [substring] among (
    'h-' 'n-' 't-' //nAthair -> n-athair, but alone are problematic

The problem is if the input initially comes as n-athair, Unicode break rules
will split this up on the hyphen into two tokens {n, athair}. You can visualize this at

This means we can add many spurious 'n' tokens in the index...

So we have two potential solutions to this:
# we can simply add 'n', 'h', 't', etc to the stopwords list. This is the simplest solution.
Would this be too aggressive?
# we can add a CharFilter for IrishAnalyzer to prevent this splitting from happening. This
is more complex.

> Analysis for Irish
> ------------------
>                 Key: LUCENE-3883
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Jim Regan
>            Priority: Trivial
>              Labels: analysis, newbie
>         Attachments: LUCENE-3883.patch, irish.sbl
> Adds analysis for Irish.
> The stemmer is generated from a snowball stemmer. I've sent it to Martin Porter, who
says it will be added during the week.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message