Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 913C6200CE7 for ; Sat, 16 Sep 2017 09:29:40 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 8F5871609D5; Sat, 16 Sep 2017 07:29:40 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D56F21609D4 for ; Sat, 16 Sep 2017 09:29:39 +0200 (CEST) Received: (qmail 83063 invoked by uid 500); 16 Sep 2017 07:29:38 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 83051 invoked by uid 99); 16 Sep 2017 07:29:38 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 16 Sep 2017 07:29:38 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id A5895C1300 for ; Sat, 16 Sep 2017 07:29:37 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.379 X-Spam-Level: ** X-Spam-Status: No, score=2.379 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id piQ65hxzd_d4 for ; Sat, 16 Sep 2017 07:29:36 +0000 (UTC) Received: from mail-yw0-f177.google.com (mail-yw0-f177.google.com [209.85.161.177]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 8817C5FB2F for ; Sat, 16 Sep 2017 07:29:35 +0000 (UTC) Received: by mail-yw0-f177.google.com with SMTP id t127so2605909ywg.4 for ; Sat, 16 Sep 2017 00:29:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=qiRArTCKm2NWr5/ReFPvi+zUP4S79kidGIk8L42BBW8=; b=ugOli05iH+qPN5YWA6Hg6v+C6SOZs8KiX77JH8tnw6vAmtJwUQ9fe9nlyfj+mSVqr2 BzzlK1faMqxGv5z4nlAGoImbZ9wxIBOoB4HybUHH/sjwljgGvHvas+2WvLS/ux81jb8w wAerhLOGAIouNuZcz0SrBi5AFjVXPmYz9gdfMXUnoqaeSGs8LJWrxNdEpAOR1xWgpBE8 HMkYW2TySr4zTmrFaQAA31K9aRB1R7xJHlQbkYB9FJB81n3fwYUsxcdRbGF7/aeozIHS u+b+mPHThRFK14f992VHv+8lwWORaq8034B0uS5/v4VNRQcoLjER5srv76QUyPR+rTgr DZWw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=qiRArTCKm2NWr5/ReFPvi+zUP4S79kidGIk8L42BBW8=; b=naTfVXF/KSn3/6eacIznDOk6AZGoKQmdIMpkL8fujL9XpLbDrSPVT9OHTB0/JJ//To +KlL6N/dznmsLFStzZ/xlwHw8UiU1ldq0GkqHryMBwbJVccSs5C9TCSY/RK5YpIn7kef XVjrBiEeC7yj2EoPkjn1l4KSlWBzAANkcBH0+0W82bhWjQKl9bsZ8lYAe/kTXO5ANTdY 24u/bsHWunxYWZCiv1E/OVYDh4DRZl6CN+SEJsEK3cM2O2PaSJ4e0IOW7jRIChPla2ta sMTq6NTMXQ573RrBmeabkomC1I6WpTXaNJc1gKu+FIQSbhbzzAYM5km8bEIbb7Ha0me+ iuSA== X-Gm-Message-State: AHPjjUgAYNpgAcgUbnD0D7lpuZQs+18uzG//ny3+F6WA6WMz2HFRiwlN c+Js2Fl/puJIqZupdskIrtv2X/52A4BvXGUfmzsGQw== X-Google-Smtp-Source: ADKCNb6OwiK7I4KJf6Skw4eoeLHQxMK5kyTCmMoKBeCjqB1VgYg7idIdaCuuoDH1jh8eE1rrud/Vl80AoJgFomxFHJM= X-Received: by 10.129.182.67 with SMTP id h3mr25262288ywk.358.1505546974248; Sat, 16 Sep 2017 00:29:34 -0700 (PDT) MIME-Version: 1.0 Received: by 10.37.58.68 with HTTP; Sat, 16 Sep 2017 00:29:33 -0700 (PDT) Received: by 10.37.58.68 with HTTP; Sat, 16 Sep 2017 00:29:33 -0700 (PDT) In-Reply-To: References: From: Dawid Weiss Date: Sat, 16 Sep 2017 09:29:33 +0200 Message-ID: Subject: Re: German decompounding/tokenization with Lucene? To: Lucene Users Content-Type: multipart/alternative; boundary="94eb2c1cbcf097ca6105594978d7" archived-at: Sat, 16 Sep 2017 07:29:40 -0000 --94eb2c1cbcf097ca6105594978d7 Content-Type: text/plain; charset="UTF-8" Hi Mike. Search lucene dev archives. I did write a decompounder with Daniel Naber. The quality was not ideal but perhaps better than nothing. Also, Daniel works on languagetool.org? They should have something in there. Dawid On Sep 16, 2017 1:58 AM, "Michael McCandless" wrote: > Hello, > > I need to index documents with German text in Lucene, and I'm wondering how > people have done this in the past? > > Lucene already has a DictionaryCompoundWordTokenFilter ... is this what > people use? Are there good, open-source friendly German dictionaries > available? > > Thanks, > > Mike McCandless > > http://blog.mikemccandless.com > --94eb2c1cbcf097ca6105594978d7--