lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uwe Schindler <...@thetaphi.de>
Subject RE: German decompounding/tokenization with Lucene?
Date Sat, 16 Sep 2017 10:51:23 GMT
Ok sorting and deduping should be easy with a simple command line. Reason is that it was created
from 2 files of Björn Jacke's Data. I thought that I deduped it...

Uwe

Am 16. September 2017 12:46:29 MESZ schrieb Markus Jelsma <markus.jelsma@openindex.io>:
>Sorry, i would if i were on Github, but i am not.
>
>Thanks again!
>Markus
>
>-----Original message-----
>> From:Uwe Schindler <uwe@thetaphi.de>
>> Sent: Saturday 16th September 2017 12:45
>> To: java-user@lucene.apache.org
>> Subject: RE: German decompounding/tokenization with Lucene?
>> 
>> Send a pull request. :)
>> 
>> Uwe
>> 
>> Am 16. September 2017 12:42:30 MESZ schrieb Markus Jelsma
><markus.jelsma@openindex.io>:
>> >Hello Uwe,
>> >
>> >Thanks for getting rid of the compounds. The dictionary can be
>smaller,
>> >it still has about 1500 duplicates. It is also unsorted.
>> >
>> >Regards,
>> >Markus
>> >
>> >
>> >-----Original message-----
>> >> From:Uwe Schindler <uwe@thetaphi.de>
>> >> Sent: Saturday 16th September 2017 12:16
>> >> To: java-user@lucene.apache.org
>> >> Subject: RE: German decompounding/tokenization with Lucene?
>> >> 
>> >> Hi,
>> >> 
>> >> I published my work on Github:
>> >> 
>> >> https://github.com/uschindler/german-decompounder
>> >> 
>> >> Have fun. I am not yet 100% sure about the License of the data
>file.
>> >The original
>> >> Author (Björn Jacke) did not publish any license; but LibreOffice
>> >publishes his files
>> >> Under LGPL. So to be safe, I applied the same license for my own
>> >work.
>> >> 
>> >> Uwe
>> >> 
>> >> -----
>> >> Uwe Schindler
>> >> Achterdiek 19, D-28357 Bremen
>> >> http://www.thetaphi.de
>> >> eMail: uwe@thetaphi.de
>> >> 
>> >> > -----Original Message-----
>> >> > From: Uwe Schindler [mailto:uwe@thetaphi.de]
>> >> > Sent: Saturday, September 16, 2017 9:49 AM
>> >> > To: java-user@lucene.apache.org
>> >> > Subject: RE: German decompounding/tokenization with Lucene?
>> >> > 
>> >> > Hi Michael,
>> >> > 
>> >> > I had this issue just yesterday. I did that several times and I
>> >built a good
>> >> > dictionary in the meantime.
>> >> > 
>> >> > I have an example for Solr or Elasticsearch with the same data.
>It
>> >uses the
>> >> > HyphenationCompoundTokenFilter, but with ZIP file *and*
>dictionary
>> >(it's
>> >> > important to have both). The dictionary-only based one is just
>too
>> >slow and
>> >> > creates wrong matches, too.
>> >> > 
>> >> > The rules file is the one from the openoffice hyphenation files.
>> >Just take it as
>> >> > is (keep in mind that you need to use the "old" version ZIP
>file,
>> >not the latest
>> >> > version, as the XML format was changed). The dictionary is more
>> >important:
>> >> > It should only contain the "single words", no compounds at all.
>> >This is hard to
>> >> > get, but there is a ngerman98.zip file available with an ispell
>> >dictionary
>> >> > (https://www.j3e.de/ispell/igerman98/). This dictionary has
>several
>> >variants,
>> >> > one of them only contains the single non-compound words (about
>> >17,000
>> >> > items). This works for most cases. I converted the dictionary a
>> >bit, merged
>> >> > some files, and finally lowercased it and now I have a working
>> >solution.
>> >> > 
>> >> > The settings for the hyphcompound filter are (Elasticsearch):
>> >> > 
>> >> >             "german_decompounder": {
>> >> >                "type": "hyphenation_decompounder",
>> >> >                "word_list_path": "analysis/dictionary-de.txt",
>> >> >                "hyphenation_patterns_path":
>"analysis/de_DR.xml",
>> >> >                "only_longest_match": true,
>> >> >                "min_subword_size": 4
>> >> >             },
>> >> > 
>> >> > Important is the "only_longest_match" setting, because our
>> >dictionary for
>> >> > sure only contains "single words" (and some words that look like
>> >compounds
>> >> > bare aren't, as they were glued together. See the example in
>> >english
>> >> > "policeman" is not written "police man" in English, because it’s
>a
>> >word on its
>> >> > own). So the longest match is always safe as we have a "good
>> >maintained"
>> >> > dictionary.
>> >> > 
>> >> > If you are interested I can send you a ZIP file with both files.
>> >Maybe I should
>> >> > check them into github, but I have to check licenses first.
>> >> > 
>> >> > Uwe
>> >> > 
>> >> > -----
>> >> > Uwe Schindler
>> >> > Achterdiek 19, D-28357 Bremen
>> >> > http://www.thetaphi.de
>> >> > eMail: uwe@thetaphi.de
>> >> > 
>> >> > > -----Original Message-----
>> >> > > From: Michael McCandless [mailto:lucene@mikemccandless.com]
>> >> > > Sent: Saturday, September 16, 2017 12:58 AM
>> >> > > To: Lucene Users <java-user@lucene.apache.org>
>> >> > > Subject: German decompounding/tokenization with Lucene?
>> >> > >
>> >> > > Hello,
>> >> > >
>> >> > > I need to index documents with German text in Lucene, and I'm
>> >wondering
>> >> > > how
>> >> > > people have done this in the past?
>> >> > >
>> >> > > Lucene already has a DictionaryCompoundWordTokenFilter ... is
>> >this what
>> >> > > people use?  Are there good, open-source friendly German
>> >dictionaries
>> >> > > available?
>> >> > >
>> >> > > Thanks,
>> >> > >
>> >> > > Mike McCandless
>> >> > >
>> >> > > http://blog.mikemccandless.com
>> >> > 
>> >> > 
>> >> >
>>
>>---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> > For additional commands, e-mail:
>java-user-help@lucene.apache.org
>> >> 
>> >> 
>> >>
>---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> 
>> >> 
>> >
>>
>>---------------------------------------------------------------------
>> >To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> --
>> Uwe Schindler
>> Achterdiek 19, 28357 Bremen
>> https://www.thetaphi.de
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de
Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message