commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marc Pompl (JIRA)" <>
Subject [jira] Commented: (CODEC-107) Enhance documentation for Language Encoders
Date Sat, 05 Feb 2011 00:35:30 GMT


Marc Pompl commented on CODEC-107:

You mean I should provide a patch for documentation? Where do I find the "source" of it? Somewhere
in repository? Where do I find it? I would  give it a try.

Regarding your assumption of performance comparisions, I agree with you, if---and only if---you
are pinpointed to the "best encoder for your encoding needs". 
If you have to respect speed---or let's say responsiveness---as a key business value, then
you have to tradeoff speed and accuracy, sometimes. As you now, your encoders are used in
search engines like SOLR in a basic manner. The critical aspect is the indexing of searchable
data. If your business case handles a lot of heavily changed data sets, it could really hurt
the search performance.
So, in my opinion, it would be nice to have a clue in the documentation how fast every encoder
performs at the big view. Otherwise, every application performance engineer has to write a
tiny test scenario, like me.

> Enhance documentation for Language Encoders
> -------------------------------------------
>                 Key: CODEC-107
>                 URL:
>             Project: Commons Codec
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Marc Pompl
>            Priority: Minor
>             Fix For: 1.5
>   Original Estimate: 1h
>  Remaining Estimate: 1h
> The current userguide ( just lists four
Language Encoders, but there are five at the moment. CODEC-106 implements a sixth one.
> Would be a good idea, to complete documentation.
> Additionally, I suggest to extent the userguide in order to show a simple performance
> _SNIP_
> org.apache.commons.codec.language.Metaphone encodings per msec: 327
> org.apache.commons.codec.language.DoubleMetaphone encodings per msec: 224
> org.apache.commons.codec.language.Soundex encodings per msec: 904
> org.apache.commons.codec.language.RefinedSoundex encodings per msec: 637
> org.apache.commons.codec.language.Caverphone encodings per msec: 5
> org.apache.commons.codec.language.ColognePhonetic encodings per msec: 289
> So, Soundex is the fastest encoder. Caverphone is much slower than any other algorithm.
All others show off nearly the same performance.
> Checked with the following code:
> {code:java}
>   private static final int REPEATS = 1000000;
>   public void checkSpeed() throws Exception {
> 	  checkSpeedEncoding(new Metaphone(), "easgasg", REPEATS);
> 	  checkSpeedEncoding(new DoubleMetaphone(), "easgasg", REPEATS);
> 	  checkSpeedEncoding(new Soundex(), "easgasg", REPEATS);
> 	  checkSpeedEncoding(new RefinedSoundex(), "easgasg", REPEATS);
> 	  checkSpeedEncoding(new Caverphone(), "Carlene", 100000);
> 	  checkSpeedEncoding(new ColognePhonetic(), "Schmitt", REPEATS);
>   }
>   private void checkSpeedEncoding(Encoder encoder, String toBeEncoded, int repeats) throws
Exception {
> 	  long start = System.currentTimeMillis();
> 	  for ( int i=0; i<repeats; i++) {
> 		    encoder.encode(toBeEncoded);
> 	  }
> 	  long duration = System.currentTimeMillis()-start;
> 	  System.out.println(encoder.getClass().getName() + " encodings per msec: "+(repeats/duration));
>   }
> {code}
> _SNAP_

This message is automatically generated by JIRA.
For more information on JIRA, see:


View raw message