incubator-ooo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Weir <robw...@apache.org>
Subject Re: Hunspell dictionaries are not just words lists (+ other matters)
Date Mon, 07 Nov 2011 11:51:13 GMT
On Mon, Nov 7, 2011 at 6:05 AM, Olivier R. <olivier.noreply@gmail.com> wrote:
> Hello everyone,
>
> I don’t like mailing-lists, so I have subscribed here just to explain few
> things about dictionaries. Then I’ll vanish.
>
> Rob Weir wrote:
>>
>> Just make sure that you explain what a spell checking dictionary is.
>> Otherwise any legal types will be confused.  This is not a dictionary
>> like Webster's, with words and definitions, where the definitions are
>> creative content.  A spell checking dictionary is more of a word list.
>>  I'm not sure what the creative expression is in a list of all common
>> words in a language and how that could be copyrighted.  Of course, I
>> am not a lawyer.
>
> Few dictionaries are just words lists, but most of them are lists of words
> tagged with flags described in an affixation file which specify what are the
> rules to generate inflexions. This affixation file can be quite simple or
> very complex. And this can be a difficult matter.
>  It looks easy at first, but when you begin to get deeper in this matter,
> there is often a lot of issues to handle. Create a proper affixation file
> can really be a hard work. And even if the difficulty is
> not high, this can be a very long job.
>  So, no, Hunspell dictionaries are not just words lists.
>

What you say above does not really make a legal difference.  What
makes something copyright-able is creative expression. not hard work.
You could spend decades collecting data on bird populations, measuring
the positions of stars, recording the names of everyone in your town,
etc., and this could all be very hard work.  But in the end what you
have is just a set of facts.  It is not a creative work.

if you read the Wikipedia link I sent before, you can see how the
courts have rejected the "sweat of the brow" theory for copyright.

> For example, it took me one year and countless hours of work to rewrite the
> affixation file of the French dictionaries from scratch. Even after that,
> there were still a lot of bugs (not spelling mistakes). For one year, I had
> to patch regularly the affixation file. Even after few years, there is still
> sometimes something to fix. The French dictionaries contain approximatively
> 13000 rules.

But if you are just expressing the facts of the language, encoding the
well-known affix rules that already exist, then may not be creative
expression.

I don't mean to diminish the effort.  Here is an analogy.  A
mathematician can work his entire life to prove a new theorem, but
cannot get a patent for that proof, because it is not a patentable
subject matter.  But someone else could come up with a trivial idea
and get a patent for it.  Effort and difficulty are not the primary
criteria for what defines intellectual property.  Sorry.

>  Here an example of one of the most complex flags:
> http://www.dicollecte.org/affixes.php?prj=fr&flag=c2
>
> (AFAIK, there is only one dictionary which has a more complex affixation
> file, the Hungarian one.)
>
> I also tagged the affixation file in order to generate 4 different
> dictionaries with a script, to offer to users the mean to write according to
> their preferences towards the optional and controversial French spelling
> reform of 1990.
>
> Besides, 99 % of entries have been manually grammatically tagged.
>  Several contributors did a tremendous job by adding lexical tags, adding
> many words, moving entries in different subdictionaries according to our
> policy, handling special cases, reporting mistakes and issues. Because,
> spelling matters are much more complex than you think,
> especially if you want to use your dictionary for grammar checking.
>  We often have to handle old, new or variant spelling just for one word, and
> there are decisions to take about what to do with special cases, which are
> actually very numerous. Managing dictionaries is not a trivial task.

No one is saying the effort was trivial.  It would also not be trivial
to catalog the positions of all the visible stars.  But that does not
make it a creative effort that can be protected by copyright.

>  Here is the "bugtracker" where we work on the French dictionaries.
> http://www.dicollecte.org/propositions.php?prj=fr&tab=E [fr]
>  (This bugtracker also allows us to commit in the dictionary in the
> database.)
>  The changelog:
> http://www.dicollecte.org/log.php?prj=fr
>
> This dictionary is used by the both French grammar checkers.
>
> What you said about copyright could be right for a list generated by script
> from a corpus, but that’s not true for dictionaries who are conceived by
> human with their knowledge, their work and their choices.
>

It depends on whether there is creative expression or not.  If it is
just fact collection and encoding with a quality checking process
behind it, then I'm not so sure.

>
>> But we'll never resolve this on legal grounds.  At Apache we would not
>> bundle a dictionary under a legal theory if the compiler of the
>> dictionary did not want us to.  I think we should respect the
>> dictionary compiler's wishes and intent,
>> _even if legally we're not obligated to_.
>
> Wow... That’s really not encouraging for people who may consider to change
> the license of their work... Does IBM think the same way?

Maybe it was unclear what I said here.  I said that even if we thought
a work was not copyright-able, we would still not distribute it if the
dictionary compiler did not want us to.

As for IBM, we have our own spell checking dictionaries, so this is
not an issue.

>  Few years ago, when I began to contribute for FLOSS, I thought the less
> restrictive licenses were the better ones, only because I didn’t care and I
> was ignorant about licensing and political matters.
>  As time goes, I think more and more the opposite. And when I read you, I’m
> beginning to think I was still too soft on that topic.
>
>
>> 3) We could contact the compilers of the dictionary and ask if they
>> would make them available under a difference license.   Generally
>> people make things available under an OSS license because they want to
>> see other projects use them.  If we tell them that a leading
>> application like OpenOffice can no longer user their dictionary, this
>> might persuade them to change their license.
>
> Here is the situation for the French dictionaries:
>
> 1. The Hunspell spelling dictionaries
>  Licenses: MPL/LGPL/GPL
>

That's fine.  If the French dictionaries are MPL we can use them.  Thanks!

<snip>

> 2. The thesaurus
>  The initial and main author released it under license LGPL.
>  Now he’s dead. AFAIK, there is no way to change the license before his work


I think you can make a better argument for a thesaurus being a creative work.

> is considered as puplic domain, but there also have been several
> improvements on the initial work.

Copyright lasts many years beyond an author's death.

>  At the moment, I am working on it to transform it as a list of "synsets"
> which could be used to generate a better thesaurus. A list of synsets would
> be a far better basis to work on. I don’t know if I will succeed. This is a
> difficult matter and it requires a lot of work.
>
>
> 3. Hyphenation rules
>  Licence LGPL.
>  This is a dictionary converted from the hyphenation rules for TeX,
> modified somehow to handle several issues.
>  I did nothing on it. I’m just packaging it in the extensions for
> OOo/LibO. You'll have to contact the peoples who created it.
>
>
>> 4) We could convert another word list or dictionary, one that has a
>> better license,  into Hunspell format.
>
> Hmmm...
>  You may generate affixation rules for Myspell with a script… but then,
> these dictionaries will probably be such a mess that you’ll be very lucky if
> you find someone with enough abnegation to improve it. The main issues of
> dictionaries are:
>  - if you just create a list of words, you may only improve it with text

<snip>

Yes, it would be difficult.  That is why I put it last on the list.
It is not our first preference.

-Rob

Mime
View raw message