incubator-ooo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Giffuni <...@apache.org>
Subject Re: Hunspell dictionaries are not just words lists (+ other matters)
Date Mon, 07 Nov 2011 11:37:06 GMT
For the record,

I respect that this type of work takes a *lot* of time and
hard work, and that people do have the right to make their
work copyleft.

There is however, for practical purposes, a huge difference
for us between MPL/LGPL (the french case) and GPL-only (the
italian case).

Pedro.

--- On Mon, 11/7/11, Olivier R. <olivier.noreply@gmail.com> wrote:


> Hello everyone,
> 
> I don’t like mailing-lists, so I have subscribed here
> just to explain few things about dictionaries. Then I’ll
> vanish.
> 
> Rob Weir wrote:
> > Just make sure that you explain what a spell checking
> dictionary is.
> > Otherwise any legal types will be confused.  This
> is not a dictionary
> > like Webster's, with words and definitions, where the
> definitions are
> > creative content.  A spell checking dictionary is
> more of a word list.
> >   I'm not sure what the creative
> expression is in a list of all common
> > words in a language and how that could be
> copyrighted.  Of course, I
> > am not a lawyer.
> 
> Few dictionaries are just words lists, but most of them are
> lists of words tagged with flags described in an affixation
> file which specify what are the rules to generate
> inflexions. This affixation file can be quite simple or very
> complex. And this can be a difficult matter.
>   It looks easy at first, but when you begin to get
> deeper in this matter, there is often a lot of issues to
> handle. Create a proper affixation file can really be a hard
> work. And even if the difficulty is
> not high, this can be a very long job.
>   So, no, Hunspell dictionaries are not just words
> lists.
> 
> For example, it took me one year and countless hours of
> work to rewrite the affixation file of the French
> dictionaries from scratch. Even after that, there were still
> a lot of bugs (not spelling mistakes). For one year, I had
> to patch regularly the affixation file. Even after few
> years, there is still sometimes something to fix. The French
> dictionaries contain approximatively 13000 rules.
>   Here an example of one of the most complex flags:
> http://www.dicollecte.org/affixes.php?prj=fr&flag=c2
> 
> (AFAIK, there is only one dictionary which has a more
> complex affixation file, the Hungarian one.)
> 
> I also tagged the affixation file in order to generate 4
> different dictionaries with a script, to offer to users the
> mean to write according to their preferences towards the
> optional and controversial French spelling reform of 1990.
> 
> Besides, 99 % of entries have been manually grammatically
> tagged.
>   Several contributors did a tremendous job by adding
> lexical tags, adding many words, moving entries in different
> subdictionaries according to our policy, handling special
> cases, reporting mistakes and issues. Because, spelling
> matters are much more complex than you think,
> especially if you want to use your dictionary for grammar
> checking.
>   We often have to handle old, new or variant spelling
> just for one word, and there are decisions to take about
> what to do with special cases, which are actually very
> numerous. Managing dictionaries is not a trivial task.
>   Here is the "bugtracker" where we work on the French
> dictionaries.
> http://www.dicollecte.org/propositions.php?prj=fr&tab=E
> [fr]
>   (This bugtracker also allows us to commit in the
> dictionary in the database.)
>   The changelog:
> http://www.dicollecte.org/log.php?prj=fr
> 
> This dictionary is used by the both French grammar
> checkers.
> 
> What you said about copyright could be right for a list
> generated by script from a corpus, but that’s not true for
> dictionaries who are conceived by human with their
> knowledge, their work and their choices.
> 
> 
> > But we'll never resolve this on legal grounds. 
> At Apache we would not
> > bundle a dictionary under a legal theory if the
> compiler of the
> > dictionary did not want us to.  I think we should
> respect the
> > dictionary compiler's wishes and intent,
> > _even if legally we're not obligated to_.
> 
> Wow... That’s really not encouraging for people who may
> consider to change the license of their work... Does IBM
> think the same way?
>   Few years ago, when I began to contribute for FLOSS,
> I thought the less restrictive licenses were the better
> ones, only because I didn’t care and I was ignorant about
> licensing and political matters.
>   As time goes, I think more and more the opposite.
> And when I read you, I’m beginning to think I was still
> too soft on that topic.
> 
> 
> > 3) We could contact the compilers of the dictionary
> and ask if they
> > would make them available under a difference
> license.   Generally
> > people make things available under an OSS license
> because they want to
> > see other projects use them.  If we tell them
> that a leading
> > application like OpenOffice can no longer user their
> dictionary, this
> > might persuade them to change their license.
> 
> Here is the situation for the French dictionaries:
> 
> 1. The Hunspell spelling dictionaries
>   Licenses: MPL/LGPL/GPL
> 
>   As I am the sole author of the affixation file, as I
> grammatically tagged myself about 90 % of all entries
> (without copying another lexicon with a script), I can say
> for sure that I do not intend to change the licenses for the
> Apache one.
>   When I built Dicollecte, my goal was to encourage
> people to contribute for all and give back the improvements
> they did. Switching to the Apache license would be a
> contradiction with everything I did.
> 
>   By the way, these dictionaries _require_ Hunspell.
> They won’t work properly with Myspell. I saw a lot of
> people think Hunspell dictionaries will work with Myspell.
> That’s a wrong assumption. Hunspell can use Myspell
> dictionaries, but Hunspell also offers a lot of new features
> which allow to improve the dictionaries structure.
>   And Myspell does not recognize double suffixation or
> double prefixation, cannot handle duplicate lemmas, does not
> handle morphological tags, has a limited amount of flags,
> does not recognize Hunspell compound commands, etc. (I am
> not even sure that Myspell can use UTF-8 files.)
> 
>   But, good for you, AFAIK, many dictionnaries still
> have a Myspell structure. But not the French ones and some
> others.
> 
> 
> 2. The thesaurus
>   The initial and main author released it under
> license LGPL.
>   Now he’s dead. AFAIK, there is no way to change
> the license before his work is considered as puplic domain,
> but there also have been several improvements on the initial
> work.
>   At the moment, I am working on it to transform it as
> a list of "synsets" which could be used to generate a better
> thesaurus. A list of synsets would be a far better basis to
> work on. I don’t know if I will succeed. This is a
> difficult matter and it requires a lot of work.
> 
> 
> 3. Hyphenation rules
>   Licence LGPL.
>   This is a dictionary converted from the hyphenation
> rules for TeX,
> modified somehow to handle several issues.
>   I did nothing on it. I’m just packaging it in the
> extensions for
> OOo/LibO. You'll have to contact the peoples who created
> it.
> 
> 
> > 4) We could convert another word list or dictionary,
> one that has a
> > better license,  into Hunspell format.
> 
> Hmmm...
>   You may generate affixation rules for Myspell with a
> script… but then, these dictionaries will probably be such
> a mess that you’ll be very lucky if you find someone with
> enough abnegation to improve it. The main issues of
> dictionaries are:
>   - if you just create a list of words, you may only
> improve it with text parser or other lexicons, but it will
> be hard and annoying to improve it manually, as the list
> will be very, very long, and it will be a memory waste. And
> each times you will regenerate it with your script, you’ll
> have to fix again manually what you did before.
>   - if you create an affixation file with script, your
> dictionary will be a mess, no easy way to improve it, as the
> dictionary structure will not be intuitive for a human
> being. And again, you cannot really mix improvements by
> scripting and improvements by human being.
>   The best way is to get somewhere a good lexicon
> already tagged with a non-restrictive license. Even then,
> you’ll have to write manually a proper affixation file…
> and then, you may discover it is not the easy task you may
> think it is, unless your language is somehow very logical,
> with neither exceptions, neither weird stuff…
> 
> 
> Regards,
> Olivier R.
> 

Mime
View raw message