commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Pocock (Commented) (JIRA)" <>
Subject [jira] [Commented] (CODEC-132) BeiderMorseEncoder OOM issues
Date Wed, 07 Mar 2012 19:18:57 GMT


Matthew Pocock commented on CODEC-132:


Limiting the size of the set of intermediate phonemes considered is probably a good thing
for this kind of random-string testing, and may well have no discernible negative impact in
normal use. The rules are not really intended to apply to random strings, and words from languages
(and in particular, names) are very much not random.

I've not run a corpus of real names through this code to estimate the normal range of this
phoneme set size. If we start seeing incomplete or strange results after this change, perhaps
it would be worth doing.


> BeiderMorseEncoder OOM issues
> -----------------------------
>                 Key: CODEC-132
>                 URL:
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Robert Muir
>         Attachments: CODEC-132.patch, CODEC-132_test.patch
> In Lucene/Solr, we integrated this encoder into the latest release.
> Our tests use a variety of random strings, and we have recent jenkins failures
> from some input streams (of length <= 10), using huge amounts of memory (e.g. >
> resulting in OOM.
> I've created a test case (length is 30 here) that will OOM with -Xmx256M. 
> I haven't dug into this much as to what's causing it, but I suspect there might be a
> revolving around certain punctuation characters: we didn't see this happening until
> we beefed up our random string generation to start producing "html-like" strings.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message