Return-Path: X-Original-To: apmail-incubator-ooo-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-ooo-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6604977D3 for ; Mon, 7 Nov 2011 11:05:48 +0000 (UTC) Received: (qmail 96469 invoked by uid 500); 7 Nov 2011 11:05:48 -0000 Delivered-To: apmail-incubator-ooo-dev-archive@incubator.apache.org Received: (qmail 96435 invoked by uid 500); 7 Nov 2011 11:05:48 -0000 Mailing-List: contact ooo-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: ooo-dev@incubator.apache.org Delivered-To: mailing list ooo-dev@incubator.apache.org Received: (qmail 96427 invoked by uid 99); 7 Nov 2011 11:05:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Nov 2011 11:05:48 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of olivier.noreply@gmail.com designates 74.125.82.43 as permitted sender) Received: from [74.125.82.43] (HELO mail-ww0-f43.google.com) (74.125.82.43) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Nov 2011 11:05:41 +0000 Received: by wwf4 with SMTP id 4so6546017wwf.0 for ; Mon, 07 Nov 2011 03:05:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject :content-type:content-transfer-encoding; bh=qLijY7galKOk1iErWEMNd9YxZN9h47XN/SFpW+N602Y=; b=hG6RkdfjNb/hZ92g8MOWqG1jjDZWdMb+9I80w8/EoBSrQXZNfgRqyteVk/N1WJ1ctz R4vcTMeeflvCQPfFQMh6o9P0lLAFmLDDwA8YfGM5Mu57QLfo66XKgJuVpic8/S6ELdks pm/1dhDtnQ+1eGegMgmvPrIFofVgd3lJUjEEM= Received: by 10.227.208.213 with SMTP id gd21mr18494350wbb.19.1320663920446; Mon, 07 Nov 2011 03:05:20 -0800 (PST) Received: from [192.168.0.111] (4ab54-2-82-228-53-203.fbx.proxad.net. [82.228.53.203]) by mx.google.com with ESMTPS id eu16sm27358833wbb.7.2011.11.07.03.05.19 (version=SSLv3 cipher=OTHER); Mon, 07 Nov 2011 03:05:19 -0800 (PST) Message-ID: <4EB7BB6C.10304@gmail.com> Date: Mon, 07 Nov 2011 12:05:16 +0100 From: "Olivier R." User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1 MIME-Version: 1.0 To: ooo-dev@incubator.apache.org Subject: Hunspell dictionaries are not just words lists (+ other matters) Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Hello everyone, I don’t like mailing-lists, so I have subscribed here just to explain few things about dictionaries. Then I’ll vanish. Rob Weir wrote: > Just make sure that you explain what a spell checking dictionary is. > Otherwise any legal types will be confused. This is not a dictionary > like Webster's, with words and definitions, where the definitions are > creative content. A spell checking dictionary is more of a word list. > I'm not sure what the creative expression is in a list of all common > words in a language and how that could be copyrighted. Of course, I > am not a lawyer. Few dictionaries are just words lists, but most of them are lists of words tagged with flags described in an affixation file which specify what are the rules to generate inflexions. This affixation file can be quite simple or very complex. And this can be a difficult matter. It looks easy at first, but when you begin to get deeper in this matter, there is often a lot of issues to handle. Create a proper affixation file can really be a hard work. And even if the difficulty is not high, this can be a very long job. So, no, Hunspell dictionaries are not just words lists. For example, it took me one year and countless hours of work to rewrite the affixation file of the French dictionaries from scratch. Even after that, there were still a lot of bugs (not spelling mistakes). For one year, I had to patch regularly the affixation file. Even after few years, there is still sometimes something to fix. The French dictionaries contain approximatively 13000 rules. Here an example of one of the most complex flags: http://www.dicollecte.org/affixes.php?prj=fr&flag=c2 (AFAIK, there is only one dictionary which has a more complex affixation file, the Hungarian one.) I also tagged the affixation file in order to generate 4 different dictionaries with a script, to offer to users the mean to write according to their preferences towards the optional and controversial French spelling reform of 1990. Besides, 99 % of entries have been manually grammatically tagged. Several contributors did a tremendous job by adding lexical tags, adding many words, moving entries in different subdictionaries according to our policy, handling special cases, reporting mistakes and issues. Because, spelling matters are much more complex than you think, especially if you want to use your dictionary for grammar checking. We often have to handle old, new or variant spelling just for one word, and there are decisions to take about what to do with special cases, which are actually very numerous. Managing dictionaries is not a trivial task. Here is the "bugtracker" where we work on the French dictionaries. http://www.dicollecte.org/propositions.php?prj=fr&tab=E [fr] (This bugtracker also allows us to commit in the dictionary in the database.) The changelog: http://www.dicollecte.org/log.php?prj=fr This dictionary is used by the both French grammar checkers. What you said about copyright could be right for a list generated by script from a corpus, but that’s not true for dictionaries who are conceived by human with their knowledge, their work and their choices. > But we'll never resolve this on legal grounds. At Apache we would not > bundle a dictionary under a legal theory if the compiler of the > dictionary did not want us to. I think we should respect the > dictionary compiler's wishes and intent, > _even if legally we're not obligated to_. Wow... That’s really not encouraging for people who may consider to change the license of their work... Does IBM think the same way? Few years ago, when I began to contribute for FLOSS, I thought the less restrictive licenses were the better ones, only because I didn’t care and I was ignorant about licensing and political matters. As time goes, I think more and more the opposite. And when I read you, I’m beginning to think I was still too soft on that topic. > 3) We could contact the compilers of the dictionary and ask if they > would make them available under a difference license. Generally > people make things available under an OSS license because they want to > see other projects use them. If we tell them that a leading > application like OpenOffice can no longer user their dictionary, this > might persuade them to change their license. Here is the situation for the French dictionaries: 1. The Hunspell spelling dictionaries Licenses: MPL/LGPL/GPL As I am the sole author of the affixation file, as I grammatically tagged myself about 90 % of all entries (without copying another lexicon with a script), I can say for sure that I do not intend to change the licenses for the Apache one. When I built Dicollecte, my goal was to encourage people to contribute for all and give back the improvements they did. Switching to the Apache license would be a contradiction with everything I did. By the way, these dictionaries _require_ Hunspell. They won’t work properly with Myspell. I saw a lot of people think Hunspell dictionaries will work with Myspell. That’s a wrong assumption. Hunspell can use Myspell dictionaries, but Hunspell also offers a lot of new features which allow to improve the dictionaries structure. And Myspell does not recognize double suffixation or double prefixation, cannot handle duplicate lemmas, does not handle morphological tags, has a limited amount of flags, does not recognize Hunspell compound commands, etc. (I am not even sure that Myspell can use UTF-8 files.) But, good for you, AFAIK, many dictionnaries still have a Myspell structure. But not the French ones and some others. 2. The thesaurus The initial and main author released it under license LGPL. Now he’s dead. AFAIK, there is no way to change the license before his work is considered as puplic domain, but there also have been several improvements on the initial work. At the moment, I am working on it to transform it as a list of "synsets" which could be used to generate a better thesaurus. A list of synsets would be a far better basis to work on. I don’t know if I will succeed. This is a difficult matter and it requires a lot of work. 3. Hyphenation rules Licence LGPL. This is a dictionary converted from the hyphenation rules for TeX, modified somehow to handle several issues. I did nothing on it. I’m just packaging it in the extensions for OOo/LibO. You'll have to contact the peoples who created it. > 4) We could convert another word list or dictionary, one that has a > better license, into Hunspell format. Hmmm... You may generate affixation rules for Myspell with a script… but then, these dictionaries will probably be such a mess that you’ll be very lucky if you find someone with enough abnegation to improve it. The main issues of dictionaries are: - if you just create a list of words, you may only improve it with text parser or other lexicons, but it will be hard and annoying to improve it manually, as the list will be very, very long, and it will be a memory waste. And each times you will regenerate it with your script, you’ll have to fix again manually what you did before. - if you create an affixation file with script, your dictionary will be a mess, no easy way to improve it, as the dictionary structure will not be intuitive for a human being. And again, you cannot really mix improvements by scripting and improvements by human being. The best way is to get somewhere a good lexicon already tagged with a non-restrictive license. Even then, you’ll have to write manually a proper affixation file… and then, you may discover it is not the easy task you may think it is, unless your language is somehow very logical, with neither exceptions, neither weird stuff… Regards, Olivier R.