commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Gregory <GGreg...@seagullsoftware.com>
Subject [codec] Large test data set!
Date Tue, 25 Jan 2011 20:01:50 GMT
Hi All:

I just found a data set that I would like to integrate with [codec] to test the language package:

http://sourceforge.net/projects/familynamephon/

The test data file contains 837K German names (37MB) in a text file and encodings in Cham
(?) phonetics, Cologne phonetics, Metaphone, and Soundex.

I have no idea how long it would take to run a test for our language encoders on this but
I imagine making it an optional unit test. How do you do THAT in Maven?

The data is covered (I think, I do not read German) by this license: http://www.opendatacommons.org/licenses/odbl/1.0/

Thoughts?
Gary Gregory
Senior Software Engineer
Rocket Software
3340 Peachtree Road, Suite 820 * Atlanta, GA 30326 * USA
Tel: +1.404.760.1560
Email: ggregory@seagullsoftware.com<mailto:ggregory@seagullsoftware.com>
Web: seagull.rocketsoftware.com<http://www.seagull.rocketsoftware.com/>



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message