lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Peuss (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1166) A tokenfilter to decompose compound words
Date Mon, 03 Mar 2008 10:34:50 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574418#action_12574418
] 

Thomas Peuss commented on LUCENE-1166:
--------------------------------------

bq. Thomas, I think that might work for Chinese - going through the "string" of Chinese characters,
one at a time, and looking up a dictionary after each additional character. One you find a
dictionary match, you look at one more character. If that matches a dictionary entry, keep
doing that until you keep matching dictionary entries (in order to grab the longest dictionary-matching
string of characters). If the next character does not match, then the previous/last character
was the end of the dictionary entry. That would work, no?

I have started to look into this. I will add the constructor parameter "onlyLongestMatch"
(default is false).

bq. As for the license info, I think you could take the approach where the required libraries
are not included in the contribution in the ASF repo, but are downloaded on the fly, at build
time, much like some other contributions. Could you do that?

I pull the grammar files for the tests already. But I don't know if it makes sense to pull
them on build time because the end-user can easily download them. I need the XML versions
now - so the jar-file from Sourceforge does not help anymore (I have included the needed classes
from the FOP project - they use the ASF license as well).

> A tokenfilter to decompose compound words
> -----------------------------------------
>
>                 Key: LUCENE-1166
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1166
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Thomas Peuss
>         Attachments: CompoundTokenFilter.patch, CompoundTokenFilter.patch, CompoundTokenFilter.patch,
de.xml, hyphenation.dtd
>
>
> A tokenfilter to decompose compound words you find in many germanic languages (like German,
Swedish, ...) into single tokens.
> An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so that you
can find the word even when you only enter "Schiff".
> I use the hyphenation code from the Apache XML project FOP (http://xmlgraphics.apache.org/fop/)
to do the first step of decomposition. Currently I use the FOP jars directly. I only use a
handful of classes from the FOP project.
> My question now:
> Would it be OK to copy this classes over to the Lucene project (renaming the packages
of course) or should I stick with the dependency to the FOP jars? The FOP code uses the ASF
V2 license as well.
> What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message