lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ivan Provalov (JIRA)" <>
Subject [jira] [Commented] (LUCENE-7321) Character Mapping
Date Thu, 09 Jun 2016 04:33:21 GMT


Ivan Provalov commented on LUCENE-7321:

Koji, this one works on a token level, allowing do things like prefix/suffix manipulations.
 Graph generator and collapser also makes it user friendly when dealing with a lot of mappings
(please see the attached description file).

> Character Mapping
> -----------------
>                 Key: LUCENE-7321
>                 URL:
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.6.1, 6.0, 5.4.1, 6.0.1
>            Reporter: Ivan Provalov
>            Priority: Minor
>              Labels: patch
>             Fix For: 6.0.1
>         Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch
> One of the challenges in search is recall of an item with a common typing variant.  These
cases can be as simple as lower/upper case in most languages, accented characters, or more
complex morphological phenomena like prefix omitting, or constructing a character with some
combining mark.  This component addresses the cases, which are not covered by ASCII folding
component, or more complex to design with other tools.  The idea is that a linguist could
provide the mappings in a tab-delimited file, which then can be directly used by Solr.
> The mappings are maintained in the tab-delimited file, which could be just a copy paste
from Excel spreadsheet.  This gives the linguists the opportunity to create the mappings,
then for the developer to include them in Solr configuration.  There are a few cases, when
the mappings grow complex, where some additional debugging may be required.  The mappings
can contain any sequence of characters to any other sequence of characters.
> Some of the cases I discuss in detail document are handling the voiced vowels for Japanese;
common typing substitutions for Korean, Russian, Polish; transliteration for Polish, Arabic;
prefix removal for Arabic; suffix folding for Japanese.  In the appendix, I give an example
of implementing a Russian light weight stemmer using this component.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message