lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (Commented) (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3883) Analysis for Irish
Date Tue, 20 Mar 2012 04:13:43 GMT


Robert Muir commented on LUCENE-3883:

To make matters worse: this exact example of splitting on hyphen for this Irish case is 
actually mentioned on

>From there it seems like the right thing to do is heuristically convert to 
U+2011 (non-breaking hyphen) but this only affects Unicode line-break rules,
not word break rules :(

So it seems like the least hackish workaround would be for a charfilter to 
convert n-athair -> nAthair (to prevent the tokenizer from splitting it up),
since the IrishLowerCaseFilter will convert it back and stem it anyway.

I'll see if i can hack something up.
> Analysis for Irish
> ------------------
>                 Key: LUCENE-3883
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Jim Regan
>            Priority: Trivial
>              Labels: analysis, newbie
>         Attachments: LUCENE-3883.patch, irish.sbl
> Adds analysis for Irish.
> The stemmer is generated from a snowball stemmer. I've sent it to Martin Porter, who
says it will be added during the week.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message