lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Peuss (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1166) A tokenfilter to decompose compound words
Date Wed, 06 Feb 2008 17:35:10 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566220#action_12566220
] 

Thomas Peuss commented on LUCENE-1166:
--------------------------------------

bq. Looking at http://offo.sourceforge.net/hyphenation/licenses.html, which seems to be the
same information as in the off-hyphenation.zip file you attached to this issue, the license
issue may be a problem - the hyphenation data is covered by different licenses on a per-language
basis. For example, there are two German data files, and both are licensed under a LaTeX license,
as is the Danish file, and these two languages are the most likely targets for your TokenFilter.
IANAL, but unless Apache licenses can be secured for this data, I don't think the files can
be incorporated directly into an Apache project.

This is true. And that's why I uploaded the two files without the ASF license grant. The FOP
project does not have the files in the code base as well because of the licensing problem.

bq. Also, I don't see Swedish among the hyphenation data licenses - is it covered in some
other way?
OFFO has no Swedish grammar file. We can generate a Swedish grammar file out of the LaTeX
grammar files. I have a look into this tonight.

All other hyphenation implementations I have found so far use them either directly or in an
converted variant like the FOP code. What we can do of course is to ask the authors of the
LaTeX files if they want to license their work under the ASF license as well. It is worth
a try. But I suppose that many email addresses in the LaTeX files are not used anymore. I
try to contact the authors of the German grammar files tomorrow.

BTW: an example for those that don't want to try the patch:
+Input token stream:+
Rindfleischüberwachungsgesetz Drahtschere abba

+Output token stream:+
(Rindfleischüberwachungsgesetz,0,29)
(Rind,0,4,posIncr=0)
(fleisch,4,11,posIncr=0)
(überwachung,11,22,posIncr=0)
(gesetz,23,29,posIncr=0)
(Drahtschere,30,41)
(Draht,30,35,posIncr=0)
(schere,35,41,posIncr=0)
(abba,42,46)

> A tokenfilter to decompose compound words
> -----------------------------------------
>
>                 Key: LUCENE-1166
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1166
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Thomas Peuss
>         Attachments: CompoundTokenFilter.patch, de.xml, hyphenation.dtd
>
>
> A tokenfilter to decompose compound words you find in many germanic languages (like German,
Swedish, ...) into single tokens.
> An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so that you
can find the word even when you only enter "Schiff".
> I use the hyphenation code from the Apache XML project FOP (http://xmlgraphics.apache.org/fop/)
to do the first step of decomposition. Currently I use the FOP jars directly. I only use a
handful of classes from the FOP project.
> My question now:
> Would it be OK to copy this classes over to the Lucene project (renaming the packages
of course) or should I stick with the dependency to the FOP jars? The FOP code uses the ASF
V2 license as well.
> What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message