lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wun...@wunderwood.org>
Subject Re: Lexical analysis tools for German language data
Date Thu, 12 Apr 2012 15:16:25 GMT
German noun decompounding is a little more complicated than it might seem.

There can be transformations or inflections, like the "s" in "Weinachtsbaum" (Weinachten/Baum).

Internal nouns should be recapitalized, like "Baum" above.

Some compounds probably should not be decompounded, like "Fahrrad" (farhren/Rad). With a dictionary-based
stemmer, you might decide to avoid decompounding for words in the dictionary.

Verbs get more complicated inflections, and might need to be decapitalized, like "farhren"
above.

Und so weiter.

Note that highlighting gets pretty weird when you are matching only part of a word.

Luckily, a lot of compounds are simple, and you could well get a measurable improvement with
a very simple algorithm. There isn't anything complicated about compounds like Orgelmusik
or Netzwerkbetreuer.

The Basis Technology linguistic analyzers aren't cheap or small, but they work well. 

wunder

On Apr 12, 2012, at 3:58 AM, Paul Libbrecht wrote:

> Bernd,
> 
> can you please say a little more?
> I think this list is ok to contain some description for commercial solutions that satisfy
a request formulated on list.
> 
> Is there any product at BASIS Tech that provides a compound-analyzer with a big dictionary
of decomposed compounds in German? If yes, for which domain? The Google Search result (I wonder
if this is politically correct to not have yours ;-)) shows me that there's an amount of job
done in this direction (e.g. Gärten to match Garten) but being precise for this question
would be more helpful!
> 
> paul
> 
> 
> Le 12 avr. 2012 à 12:46, Bernd Fehling a écrit :
> 
>> 
>> You might have a look at:
>> http://www.basistech.com/lucene/
>> 
>> 
>> Am 12.04.2012 11:52, schrieb Michael Ludwig:
>>> Given an input of "Windjacke" (probably "wind jacket" in English), I'd
>>> like the code that prepares the data for the index (tokenizer etc) to
>>> understand that this is a "Jacke" ("jacket") so that a query for "Jacke"
>>> would include the "Windjacke" document in its result set.
>>> 
>>> It appears to me that such an analysis requires a dictionary-backed
>>> approach, which doesn't have to be perfect at all; a list of the most
>>> common 2000 words would probably do the job and fulfil a criterion of
>>> reasonable usefulness.
>>> 
>>> Do you know of any implementation techniques or working implementations
>>> to do this kind of lexical analysis for German language data? (Or other
>>> languages, for that matter?) What are they, where can I find them?
>>> 
>>> I'm sure there is something out (commercial or free) because I've seen
>>> lots of engines grokking German and the way it builds words.
>>> 
>>> Failing that, what are the proper terms do refer to these techniques so
>>> you can search more successfully?
>>> 
>>> Michael





Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message