lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-906) Elision filter for simple french analyzing
Date Wed, 13 Jun 2007 07:54:25 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12504152
] 

Hoss Man commented on LUCENE-906:
---------------------------------

i don't know much about french, but a few comments...

1) "stopwords" seems like an odd name for what i would think of as a "prefix" .. you may want
an example in the javadocs to make it clear.

2) are Elison's always lowercase?  I imagine there should be an ignoreCase option just like
StopFilter has.  (note that toLowerCase() is hardcoded in the next() method, but nothing ensures
that the stopwords list is lowercased)

3) are there any other characters that can appear between an elision and it's root word besides
'\'' ? (i'm particularly wondering about other unicode characters that look like byte 0x27
but are not actually 0x27)

4) this probably doesn't need to be in it's own contrib.  contrib/analyzers should be fine
.... if Elison's are specific to french, then contrib/analyzers/src/java/org/apache/lucene/analysis/fr/
makes the most sense, otherwise it might make sense to add a new subpackage under analysis
... "linguistics" perhaps (in contrast to the existing "ngram") ?

> Elision filter for simple french analyzing
> ------------------------------------------
>
>                 Key: LUCENE-906
>                 URL: https://issues.apache.org/jira/browse/LUCENE-906
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Mathieu Lecarme
>         Attachments: elision.patch
>
>
> If you don't wont to use stemming, StandardAnalyzer miss some french strangeness like
elision.
> "l'avion" wich means "the plane" must be tokenized as "avion" (plane).
> This filter could be used with other latin language if elision exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message