lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: German decompounding/tokenization with Lucene?
Date Sat, 16 Sep 2017 11:03:56 GMT
Hi,

I deduped it. Thanks for the hint!

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Uwe Schindler [mailto:uwe@thetaphi.de]
> Sent: Saturday, September 16, 2017 12:51 PM
> To: java-user@lucene.apache.org
> Subject: RE: German decompounding/tokenization with Lucene?
> 
> Ok sorting and deduping should be easy with a simple command line. Reason
> is that it was created from 2 files of Björn Jacke's Data. I thought that I
> deduped it...
> 
> Uwe
> 
> Am 16. September 2017 12:46:29 MESZ schrieb Markus Jelsma
> <markus.jelsma@openindex.io>:
> >Sorry, i would if i were on Github, but i am not.
> >
> >Thanks again!
> >Markus
> >
> >-----Original message-----
> >> From:Uwe Schindler <uwe@thetaphi.de>
> >> Sent: Saturday 16th September 2017 12:45
> >> To: java-user@lucene.apache.org
> >> Subject: RE: German decompounding/tokenization with Lucene?
> >>
> >> Send a pull request. :)
> >>
> >> Uwe
> >>
> >> Am 16. September 2017 12:42:30 MESZ schrieb Markus Jelsma
> ><markus.jelsma@openindex.io>:
> >> >Hello Uwe,
> >> >
> >> >Thanks for getting rid of the compounds. The dictionary can be
> >smaller,
> >> >it still has about 1500 duplicates. It is also unsorted.
> >> >
> >> >Regards,
> >> >Markus
> >> >
> >> >
> >> >-----Original message-----
> >> >> From:Uwe Schindler <uwe@thetaphi.de>
> >> >> Sent: Saturday 16th September 2017 12:16
> >> >> To: java-user@lucene.apache.org
> >> >> Subject: RE: German decompounding/tokenization with Lucene?
> >> >>
> >> >> Hi,
> >> >>
> >> >> I published my work on Github:
> >> >>
> >> >> https://github.com/uschindler/german-decompounder
> >> >>
> >> >> Have fun. I am not yet 100% sure about the License of the data
> >file.
> >> >The original
> >> >> Author (Björn Jacke) did not publish any license; but LibreOffice
> >> >publishes his files
> >> >> Under LGPL. So to be safe, I applied the same license for my own
> >> >work.
> >> >>
> >> >> Uwe
> >> >>
> >> >> -----
> >> >> Uwe Schindler
> >> >> Achterdiek 19, D-28357 Bremen
> >> >> http://www.thetaphi.de
> >> >> eMail: uwe@thetaphi.de
> >> >>
> >> >> > -----Original Message-----
> >> >> > From: Uwe Schindler [mailto:uwe@thetaphi.de]
> >> >> > Sent: Saturday, September 16, 2017 9:49 AM
> >> >> > To: java-user@lucene.apache.org
> >> >> > Subject: RE: German decompounding/tokenization with Lucene?
> >> >> >
> >> >> > Hi Michael,
> >> >> >
> >> >> > I had this issue just yesterday. I did that several times and
I
> >> >built a good
> >> >> > dictionary in the meantime.
> >> >> >
> >> >> > I have an example for Solr or Elasticsearch with the same data.
> >It
> >> >uses the
> >> >> > HyphenationCompoundTokenFilter, but with ZIP file *and*
> >dictionary
> >> >(it's
> >> >> > important to have both). The dictionary-only based one is just
> >too
> >> >slow and
> >> >> > creates wrong matches, too.
> >> >> >
> >> >> > The rules file is the one from the openoffice hyphenation files.
> >> >Just take it as
> >> >> > is (keep in mind that you need to use the "old" version ZIP
> >file,
> >> >not the latest
> >> >> > version, as the XML format was changed). The dictionary is more
> >> >important:
> >> >> > It should only contain the "single words", no compounds at all.
> >> >This is hard to
> >> >> > get, but there is a ngerman98.zip file available with an ispell
> >> >dictionary
> >> >> > (https://www.j3e.de/ispell/igerman98/). This dictionary has
> >several
> >> >variants,
> >> >> > one of them only contains the single non-compound words (about
> >> >17,000
> >> >> > items). This works for most cases. I converted the dictionary
a
> >> >bit, merged
> >> >> > some files, and finally lowercased it and now I have a working
> >> >solution.
> >> >> >
> >> >> > The settings for the hyphcompound filter are (Elasticsearch):
> >> >> >
> >> >> >             "german_decompounder": {
> >> >> >                "type": "hyphenation_decompounder",
> >> >> >                "word_list_path": "analysis/dictionary-de.txt",
> >> >> >                "hyphenation_patterns_path":
> >"analysis/de_DR.xml",
> >> >> >                "only_longest_match": true,
> >> >> >                "min_subword_size": 4
> >> >> >             },
> >> >> >
> >> >> > Important is the "only_longest_match" setting, because our
> >> >dictionary for
> >> >> > sure only contains "single words" (and some words that look like
> >> >compounds
> >> >> > bare aren't, as they were glued together. See the example in
> >> >english
> >> >> > "policeman" is not written "police man" in English, because it’s
> >a
> >> >word on its
> >> >> > own). So the longest match is always safe as we have a "good
> >> >maintained"
> >> >> > dictionary.
> >> >> >
> >> >> > If you are interested I can send you a ZIP file with both files.
> >> >Maybe I should
> >> >> > check them into github, but I have to check licenses first.
> >> >> >
> >> >> > Uwe
> >> >> >
> >> >> > -----
> >> >> > Uwe Schindler
> >> >> > Achterdiek 19, D-28357 Bremen
> >> >> > http://www.thetaphi.de
> >> >> > eMail: uwe@thetaphi.de
> >> >> >
> >> >> > > -----Original Message-----
> >> >> > > From: Michael McCandless [mailto:lucene@mikemccandless.com]
> >> >> > > Sent: Saturday, September 16, 2017 12:58 AM
> >> >> > > To: Lucene Users <java-user@lucene.apache.org>
> >> >> > > Subject: German decompounding/tokenization with Lucene?
> >> >> > >
> >> >> > > Hello,
> >> >> > >
> >> >> > > I need to index documents with German text in Lucene, and
I'm
> >> >wondering
> >> >> > > how
> >> >> > > people have done this in the past?
> >> >> > >
> >> >> > > Lucene already has a DictionaryCompoundWordTokenFilter ...
is
> >> >this what
> >> >> > > people use?  Are there good, open-source friendly German
> >> >dictionaries
> >> >> > > available?
> >> >> > >
> >> >> > > Thanks,
> >> >> > >
> >> >> > > Mike McCandless
> >> >> > >
> >> >> > > http://blog.mikemccandless.com
> >> >> >
> >> >> >
> >> >> >
> >>
> >>---------------------------------------------------------------------
> >> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> > For additional commands, e-mail:
> >java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >>
> >---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >>
> >>---------------------------------------------------------------------
> >> >To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >> --
> >> Uwe Schindler
> >> Achterdiek 19, 28357 Bremen
> >> https://www.thetaphi.de
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> --
> Uwe Schindler
> Achterdiek 19, 28357 Bremen
> https://www.thetaphi.de


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message