commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben Kazez (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CODEC-248) language.DaitchMokotoffSoundex gives overly broad results for tokens containing RS
Date Mon, 06 Aug 2018 23:57:00 GMT

    [ https://issues.apache.org/jira/browse/CODEC-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16570937#comment-16570937
] 

Ben Kazez commented on CODEC-248:
---------------------------------

================ Ben

Right now GIERSZLIK is matching GOTSALK. This is because Gierszlik is
coding to 548500 or 594850, and GOTSALK is 548500. According to
Morse's site, Gierszlik should code to just 594850. I believe the
confusion is about this:

> When adjacent sounds can combine to form a larger sound, they are given the code number
of the larger sound

SZ and RS are both "larger sounds." One listing of rules I found
online says that the tokens must be matched in order, which means that
"RSZ" would be interpreted as "R SZ" instead of "RS Z". That makes
sense to me, but I didn't find any mention of that on Avotaynu after
some brief searching.

Is there some official standard for D-M rules? What does it say about
when two "larger sound" interpretations are possible?

Many thanks!

Ben

================ Gary

Ben:

I would drop RS from the table. Randy Daitch created the table and I cannot think of any language
where RS is pronounced "S" (4).

Gary


> language.DaitchMokotoffSoundex gives overly broad results for tokens containing RS
> ----------------------------------------------------------------------------------
>
>                 Key: CODEC-248
>                 URL: https://issues.apache.org/jira/browse/CODEC-248
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Ben Kazez
>            Priority: Minor
>
> I am using Apache commons codec in Elasticsearch (via Lucene).
> # GIERSZLIK codes to 548500 or 594850
> # GOTSALK codes to 548500
> # These names don't sound alike, but the matching codes means a search for one returns
the other.
> Solution: I exchanged emails with Gary Mokotoff, co-creator of the algorithm, who said:
> {quote}I would drop RS from the table. ... I cannot think of any language where RS is
pronounced "S" (4).{quote}
>  
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message