lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: German decompounding/tokenization with Lucene?
Date Sat, 16 Sep 2017 10:46:29 GMT
Sorry, i would if i were on Github, but i am not.

Thanks again!
Markus

-----Original message-----
> From:Uwe Schindler <uwe@thetaphi.de>
> Sent: Saturday 16th September 2017 12:45
> To: java-user@lucene.apache.org
> Subject: RE: German decompounding/tokenization with Lucene?
> 
> Send a pull request. :)
> 
> Uwe
> 
> Am 16. September 2017 12:42:30 MESZ schrieb Markus Jelsma <markus.jelsma@openindex.io>:
> >Hello Uwe,
> >
> >Thanks for getting rid of the compounds. The dictionary can be smaller,
> >it still has about 1500 duplicates. It is also unsorted.
> >
> >Regards,
> >Markus
> >
> >
> >-----Original message-----
> >> From:Uwe Schindler <uwe@thetaphi.de>
> >> Sent: Saturday 16th September 2017 12:16
> >> To: java-user@lucene.apache.org
> >> Subject: RE: German decompounding/tokenization with Lucene?
> >> 
> >> Hi,
> >> 
> >> I published my work on Github:
> >> 
> >> https://github.com/uschindler/german-decompounder
> >> 
> >> Have fun. I am not yet 100% sure about the License of the data file.
> >The original
> >> Author (Björn Jacke) did not publish any license; but LibreOffice
> >publishes his files
> >> Under LGPL. So to be safe, I applied the same license for my own
> >work.
> >> 
> >> Uwe
> >> 
> >> -----
> >> Uwe Schindler
> >> Achterdiek 19, D-28357 Bremen
> >> http://www.thetaphi.de
> >> eMail: uwe@thetaphi.de
> >> 
> >> > -----Original Message-----
> >> > From: Uwe Schindler [mailto:uwe@thetaphi.de]
> >> > Sent: Saturday, September 16, 2017 9:49 AM
> >> > To: java-user@lucene.apache.org
> >> > Subject: RE: German decompounding/tokenization with Lucene?
> >> > 
> >> > Hi Michael,
> >> > 
> >> > I had this issue just yesterday. I did that several times and I
> >built a good
> >> > dictionary in the meantime.
> >> > 
> >> > I have an example for Solr or Elasticsearch with the same data. It
> >uses the
> >> > HyphenationCompoundTokenFilter, but with ZIP file *and* dictionary
> >(it's
> >> > important to have both). The dictionary-only based one is just too
> >slow and
> >> > creates wrong matches, too.
> >> > 
> >> > The rules file is the one from the openoffice hyphenation files.
> >Just take it as
> >> > is (keep in mind that you need to use the "old" version ZIP file,
> >not the latest
> >> > version, as the XML format was changed). The dictionary is more
> >important:
> >> > It should only contain the "single words", no compounds at all.
> >This is hard to
> >> > get, but there is a ngerman98.zip file available with an ispell
> >dictionary
> >> > (https://www.j3e.de/ispell/igerman98/). This dictionary has several
> >variants,
> >> > one of them only contains the single non-compound words (about
> >17,000
> >> > items). This works for most cases. I converted the dictionary a
> >bit, merged
> >> > some files, and finally lowercased it and now I have a working
> >solution.
> >> > 
> >> > The settings for the hyphcompound filter are (Elasticsearch):
> >> > 
> >> >             "german_decompounder": {
> >> >                "type": "hyphenation_decompounder",
> >> >                "word_list_path": "analysis/dictionary-de.txt",
> >> >                "hyphenation_patterns_path": "analysis/de_DR.xml",
> >> >                "only_longest_match": true,
> >> >                "min_subword_size": 4
> >> >             },
> >> > 
> >> > Important is the "only_longest_match" setting, because our
> >dictionary for
> >> > sure only contains "single words" (and some words that look like
> >compounds
> >> > bare aren't, as they were glued together. See the example in
> >english
> >> > "policeman" is not written "police man" in English, because it’s a
> >word on its
> >> > own). So the longest match is always safe as we have a "good
> >maintained"
> >> > dictionary.
> >> > 
> >> > If you are interested I can send you a ZIP file with both files.
> >Maybe I should
> >> > check them into github, but I have to check licenses first.
> >> > 
> >> > Uwe
> >> > 
> >> > -----
> >> > Uwe Schindler
> >> > Achterdiek 19, D-28357 Bremen
> >> > http://www.thetaphi.de
> >> > eMail: uwe@thetaphi.de
> >> > 
> >> > > -----Original Message-----
> >> > > From: Michael McCandless [mailto:lucene@mikemccandless.com]
> >> > > Sent: Saturday, September 16, 2017 12:58 AM
> >> > > To: Lucene Users <java-user@lucene.apache.org>
> >> > > Subject: German decompounding/tokenization with Lucene?
> >> > >
> >> > > Hello,
> >> > >
> >> > > I need to index documents with German text in Lucene, and I'm
> >wondering
> >> > > how
> >> > > people have done this in the past?
> >> > >
> >> > > Lucene already has a DictionaryCompoundWordTokenFilter ... is
> >this what
> >> > > people use?  Are there good, open-source friendly German
> >dictionaries
> >> > > available?
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Mike McCandless
> >> > >
> >> > > http://blog.mikemccandless.com
> >> > 
> >> > 
> >> >
> >---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> 
> >> 
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> 
> >> 
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> --
> Uwe Schindler
> Achterdiek 19, 28357 Bremen
> https://www.thetaphi.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message