incubator-ooo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olivier R." <olivier.nore...@gmail.com>
Subject Hunspell dictionaries are not just words lists (+ other matters)
Date Mon, 07 Nov 2011 11:05:16 GMT
Hello everyone,

I don’t like mailing-lists, so I have subscribed here just to explain 
few things about dictionaries. Then I’ll vanish.

Rob Weir wrote:
> Just make sure that you explain what a spell checking dictionary is.
> Otherwise any legal types will be confused.  This is not a dictionary
> like Webster's, with words and definitions, where the definitions are
> creative content.  A spell checking dictionary is more of a word list.
>   I'm not sure what the creative expression is in a list of all common
> words in a language and how that could be copyrighted.  Of course, I
> am not a lawyer.

Few dictionaries are just words lists, but most of them are lists of 
words tagged with flags described in an affixation file which specify 
what are the rules to generate inflexions. This affixation file can be 
quite simple or very complex. And this can be a difficult matter.
   It looks easy at first, but when you begin to get deeper in this 
matter, there is often a lot of issues to handle. Create a proper 
affixation file can really be a hard work. And even if the difficulty is
not high, this can be a very long job.
   So, no, Hunspell dictionaries are not just words lists.

For example, it took me one year and countless hours of work to rewrite 
the affixation file of the French dictionaries from scratch. Even after 
that, there were still a lot of bugs (not spelling mistakes). For one 
year, I had to patch regularly the affixation file. Even after few 
years, there is still sometimes something to fix. The French 
dictionaries contain approximatively 13000 rules.
   Here an example of one of the most complex flags:
http://www.dicollecte.org/affixes.php?prj=fr&flag=c2

(AFAIK, there is only one dictionary which has a more complex affixation 
file, the Hungarian one.)

I also tagged the affixation file in order to generate 4 different 
dictionaries with a script, to offer to users the mean to write 
according to their preferences towards the optional and controversial 
French spelling reform of 1990.

Besides, 99 % of entries have been manually grammatically tagged.
   Several contributors did a tremendous job by adding lexical tags, 
adding many words, moving entries in different subdictionaries according 
to our policy, handling special cases, reporting mistakes and issues. 
Because, spelling matters are much more complex than you think,
especially if you want to use your dictionary for grammar checking.
   We often have to handle old, new or variant spelling just for one 
word, and there are decisions to take about what to do with special 
cases, which are actually very numerous. Managing dictionaries is not a 
trivial task.
   Here is the "bugtracker" where we work on the French dictionaries.
http://www.dicollecte.org/propositions.php?prj=fr&tab=E [fr]
   (This bugtracker also allows us to commit in the dictionary in the 
database.)
   The changelog:
http://www.dicollecte.org/log.php?prj=fr

This dictionary is used by the both French grammar checkers.

What you said about copyright could be right for a list generated by 
script from a corpus, but that’s not true for dictionaries who are 
conceived by human with their knowledge, their work and their choices.


> But we'll never resolve this on legal grounds.  At Apache we would not
> bundle a dictionary under a legal theory if the compiler of the
> dictionary did not want us to.  I think we should respect the
> dictionary compiler's wishes and intent,
> _even if legally we're not obligated to_.

Wow... That’s really not encouraging for people who may consider to 
change the license of their work... Does IBM think the same way?
   Few years ago, when I began to contribute for FLOSS, I thought the 
less restrictive licenses were the better ones, only because I didn’t 
care and I was ignorant about licensing and political matters.
   As time goes, I think more and more the opposite. And when I read 
you, I’m beginning to think I was still too soft on that topic.


> 3) We could contact the compilers of the dictionary and ask if they
> would make them available under a difference license.   Generally
> people make things available under an OSS license because they want to
> see other projects use them.  If we tell them that a leading
> application like OpenOffice can no longer user their dictionary, this
> might persuade them to change their license.

Here is the situation for the French dictionaries:

1. The Hunspell spelling dictionaries
   Licenses: MPL/LGPL/GPL

   As I am the sole author of the affixation file, as I grammatically 
tagged myself about 90 % of all entries (without copying another lexicon 
with a script), I can say for sure that I do not intend to change the 
licenses for the Apache one.
   When I built Dicollecte, my goal was to encourage people to 
contribute for all and give back the improvements they did. Switching to 
the Apache license would be a contradiction with everything I did.

   By the way, these dictionaries _require_ Hunspell. They won’t work 
properly with Myspell. I saw a lot of people think Hunspell dictionaries 
will work with Myspell. That’s a wrong assumption. Hunspell can use 
Myspell dictionaries, but Hunspell also offers a lot of new features 
which allow to improve the dictionaries structure.
   And Myspell does not recognize double suffixation or double 
prefixation, cannot handle duplicate lemmas, does not handle 
morphological tags, has a limited amount of flags, does not recognize 
Hunspell compound commands, etc. (I am not even sure that Myspell can 
use UTF-8 files.)

   But, good for you, AFAIK, many dictionnaries still have a Myspell 
structure. But not the French ones and some others.


2. The thesaurus
   The initial and main author released it under license LGPL.
   Now he’s dead. AFAIK, there is no way to change the license before 
his work is considered as puplic domain, but there also have been 
several improvements on the initial work.
   At the moment, I am working on it to transform it as a list of 
"synsets" which could be used to generate a better thesaurus. A list of 
synsets would be a far better basis to work on. I don’t know if I will 
succeed. This is a difficult matter and it requires a lot of work.


3. Hyphenation rules
   Licence LGPL.
   This is a dictionary converted from the hyphenation rules for TeX,
modified somehow to handle several issues.
   I did nothing on it. I’m just packaging it in the extensions for
OOo/LibO. You'll have to contact the peoples who created it.


> 4) We could convert another word list or dictionary, one that has a
> better license,  into Hunspell format.

Hmmm...
   You may generate affixation rules for Myspell with a script… but 
then, these dictionaries will probably be such a mess that you’ll be 
very lucky if you find someone with enough abnegation to improve it. The 
main issues of dictionaries are:
   - if you just create a list of words, you may only improve it with 
text parser or other lexicons, but it will be hard and annoying to 
improve it manually, as the list will be very, very long, and it will be 
a memory waste. And each times you will regenerate it with your script, 
you’ll have to fix again manually what you did before.
   - if you create an affixation file with script, your dictionary will 
be a mess, no easy way to improve it, as the dictionary structure will 
not be intuitive for a human being. And again, you cannot really mix 
improvements by scripting and improvements by human being.
   The best way is to get somewhere a good lexicon already tagged with a 
non-restrictive license. Even then, you’ll have to write manually a 
proper affixation file… and then, you may discover it is not the easy 
task you may think it is, unless your language is somehow very logical, 
with neither exceptions, neither weird stuff…


Regards,
Olivier R.

Mime
View raw message