incubator-ooo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olivier R." <>
Subject Re: Hunspell dictionaries are not just words lists (+ other matters)
Date Tue, 08 Nov 2011 10:12:30 GMT
Hello all,

Le 08/11/2011 01:58, Rob Weir a écrit :

> Spell checking dictionaries are just compilations of facts that are
> constrained by the preexisting external facts of the language.  The
> compiler of the dictionary does not create these facts.  He  merely
> encodes them.  The particular dictionary might be copyrightable as a
> specific selection, coordination and arrangement of these facts, but
> fair use would allow me to extract the  same facts from the
> dictionaries, via reverse engineering, and make my own selection,
> coordination and arrangement of these same facts and distribute them
> as my own dictionary.  In other words, you might be able to protect
> the compilation of facts, but you cannot protect the underlying facts,
> or prevent people from copying your encoding of these facts and
> distributing a different arrangement of them.  Copyright protection on
> a compilation of facts is extremely thin.  It is that simple.

I am no expert on legal matters, and I think you might get different 
legal answers in different countries.

So I’ll try to stay on technical ground.

Let’s assume that someone wants to create an Hunspell dictionary from 
scratch. He finds a huge lexicon of well-organized informations about 
his language, a proper list of words with morphological data, tags, etc. 
Let’s assume this is just a compilation of facts.

(Actually, even saying this lexicon is a mere compilation of facts is 
arguable, because there can also be a lot of specific classification, 
personal tags, interpretation data, etc. Otherwise, we wouldn’t have 
many arguments when we tagged the French dictionary. But let’s )

Does this list would _tell_ him to create an affixation file? No.
Does this list would _help_ him to create an affixation file? No.
Is there just one way to create an affixation file from this list? No.

Actually, even if I had such a lexicon of all facts on the French 
language when I began the work on the affixation file, it would have 
required as much time, as much reflexion, as much personal choices.

Creating an affixation file is on higher level than just collecting 
data. It’s not a way of classifying or tagging or selecting data.

So, what is an affixation file? That’s a description of a compression 
algorithm, a description of a human understandable logic to factorize 
data on a specific language.

The lexicon could have been compressed with zip, rar, 7z or whatever 
algorithm. In the same way, there is many ways to factorize a lexicon 
with a human understandable logic.

When I created the French affixation file, there was already one 
existing, but I was really not satisfied with it, so I rewrote it.
With the previous French dictionary, there was approximatively 600 rules 
in the affixation file, and 92,000 entries in the words list.
After one year on work on the new affixation file, there was 
approximatively 12,000 rules and 60,000 entries, but this new dictionary 
generates more inflexions than the previous one, and also far less 
mistakes (because affixation files can also have a lot of side effects 
and can generate a lot of wrong inflexions).

Even now, the compression method could be really different than it is. 
But the data set would be the same. And, actually, I’m considering of 
modifying it in a way to fit more to the grammar checker which retrieves 
these data from Hunspell.

So, is a very specific compression algorithm description for language 
data can be copyrighted? I don’t know, but I think this a creative matter.



View raw message