lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1166) A tokenfilter to decompose compound words
Date Wed, 30 Apr 2008 01:20:57 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593198#action_12593198
] 

Grant Ingersoll commented on LUCENE-1166:
-----------------------------------------

This looks pretty good Thomas.  I think the last bit that would be good is to add to the package
docs an example of start to finish using it, kind of like in the test case.  You might want
to explain a little bit about where to get the hyphenation files, etc. (if I am understanding
them correctly). 

I think if we can finish that up, we can look to commit.

The other interesting thing here, as an aside, is the Ternary Tree might be worth pulling
up to a "util" package (no need to do so now, just thinking out loud), as it could be used
for other interesting things.  For instance, see http://www.javaworld.com/javaworld/jw-02-2001/jw-0216-ternary.html
  The version we have needs a little work, but I have been thinking about how it might be
used to improve spelling, etc.

> A tokenfilter to decompose compound words
> -----------------------------------------
>
>                 Key: LUCENE-1166
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1166
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Thomas Peuss
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: CompoundTokenFilter.patch, CompoundTokenFilter.patch, CompoundTokenFilter.patch,
CompoundTokenFilter.patch, CompoundTokenFilter.patch, CompoundTokenFilter.patch, CompoundTokenFilter.patch,
CompoundTokenFilter.patch, de.xml, hyphenation.dtd
>
>
> A tokenfilter to decompose compound words you find in many germanic languages (like German,
Swedish, ...) into single tokens.
> An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so that you
can find the word even when you only enter "Schiff".
> I use the hyphenation code from the Apache XML project FOP (http://xmlgraphics.apache.org/fop/)
to do the first step of decomposition. Currently I use the FOP jars directly. I only use a
handful of classes from the FOP project.
> My question now:
> Would it be OK to copy this classes over to the Lucene project (renaming the packages
of course) or should I stick with the dependency to the FOP jars? The FOP code uses the ASF
V2 license as well.
> What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message