Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 82449 invoked from network); 23 Sep 2006 08:24:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 23 Sep 2006 08:24:43 -0000 Received: (qmail 27283 invoked by uid 500); 23 Sep 2006 08:24:38 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 27248 invoked by uid 500); 23 Sep 2006 08:24:37 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 27237 invoked by uid 99); 23 Sep 2006 08:24:37 -0000 Received: from idunn.apache.osuosl.org (HELO idunn.apache.osuosl.org) (140.211.166.84) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 23 Sep 2006 01:24:37 -0700 Authentication-Results: idunn.apache.osuosl.org smtp.mail=p.imbemba@gmail.com; spf=unknown X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=DNS_FROM_RFC_ABUSE,RCVD_IN_BL_SPAMCOP_NET Received-SPF: unknown (idunn.apache.osuosl.org: domain gmail.com does not designate 85.33.2.18 as permitted sender) Received: from [85.33.2.18] ([85.33.2.18:2545] helo=smtp-out13.alice.it) by idunn.apache.osuosl.org (ecelerity 2.1.1.8 r(12930)) with ESMTP id A6/A2-27820-44FE4154 for ; Sat, 23 Sep 2006 01:24:37 -0700 Received: from FBCMMO02.fbc.local ([192.168.68.196]) by smtp-out13.alice.it with Microsoft SMTPSVC(6.0.3790.1830); Sat, 23 Sep 2006 10:24:34 +0200 Received: from FBCMCL01B05.fbc.local ([192.168.69.86]) by FBCMMO02.fbc.local with Microsoft SMTPSVC(6.0.3790.1830); Sat, 23 Sep 2006 10:24:34 +0200 Received: from [127.0.0.1] ([87.0.24.176]) by FBCMCL01B05.fbc.local with Microsoft SMTPSVC(6.0.3790.1830); Sat, 23 Sep 2006 10:24:33 +0200 Message-ID: <4514EF30.60907@gmail.com> Date: Sat, 23 Sep 2006 10:24:16 +0200 From: Pasquale Imbemba User-Agent: Mozilla Thunderbird 1.5.0.7 (Windows/20060909) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Analysis/tokenization of compound words References: <20060919162155.56039.qmail@web50313.mail.yahoo.com> <4514EDDF.50308@gmail.com> In-Reply-To: <4514EDDF.50308@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Antivirus: avast! (VPS 0638-1, 22/09/2006), Outbound message X-Antivirus-Status: Clean X-OriginalArrivalTime: 23 Sep 2006 08:24:34.0361 (UTC) FILETIME=[B34DE290:01C6DEE9] X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Otis, I forgot to mention that I make use of Lucene for noun retrieval from the lexicon. Pasquale Pasquale Imbemba ha scritto: > Hi Otis, > > I am completing my bachelor thesis at the Free University of Bolzano > (www.unibz.it). My project is exactly about what you need: a word > splitter for German compound words. Raffaella Bernardi who is reading > in CC is my supervisor. > As some from the lucene mailing list has already suggested, I have > used the lexicon of German nouns extracted from Morphy > (http://www.wolfganglezius.de/doku.php?id=public:cl:morphy). As for > the splitting algorithm, I have used the one Maaten De Rijke and > Christof Monz have published in /Shallow Morphological Analysis in > Monolingual > Information Retrieval for Dutch, German and Italian /(website here > , document here > ). > I did some testing and minor improvement on it (as I needed to > "adjust" it for the solution I was working on) and could send you my > thesis paper (actually still in draft state), which contains > statistical data on correctness. > > Let me know > Pasquale > > Otis Gospodnetic ha scritto: >> Hi, >> >> How do people typically analyze/tokenize text with compounds (e.g. >> German)? I took a look at GermanAnalyzer hoping to see how one can >> deal with that, but it turns out GermanAnalyzer doesn't treat >> compounds in any special way at all. >> >> One way to go about this is to have a word dictionary and a tokenizer >> that processes input one character at a time, looking for a word >> match in the dictionary after each processed characters. Then, >> CompoundWordLikeThis could be broken down into multiple tokens/words >> and returned at a set of tokens at the same position. However, >> somehow this doesn't strike me as a very smart and fast approach. >> What are some better approaches? >> If anyone has implemented anything that deals with this problem, I'd >> love to hear about it. >> >> Thanks, >> Otis >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> >> > -- "As far as the laws of mathematics refer to reality, they are not certain, as far as they are certain, they do not refer to reality." (Albert Einstein) --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org