lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Günter Kukies <guenter.kuk...@heuft.com>
Subject Re: German zusammengesetzte Hauptwörter
Date Mon, 19 May 2003 10:07:25 GMT
Hi,

this is my naive TermEnumerator:


import java.io.IOException;

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.*;

/**
 * Subclass of FilteredTermEnum for enumerating all terms that match the
 * specified filter term (word snip in a term).
 * like ausweis in Mitarbeiterausweishalter
 * <p>
 * Term enumerations are always ordered by Term.compareTo().  Each term in
 * the enumeration is greater than all that precede it.
 */
public class SingleWordTermEnum extends FilteredTermEnum {
    Term searchTerm;
    String field = "";
    String text = "";

    boolean fieldMatch = false;
    boolean endEnum = false;

    /** Creates new SingleWordTermEnum */
    public SingleWordTermEnum(IndexReader reader, Term term) throws
IOException {
        super(reader, term);
        searchTerm = term;
        field = searchTerm.field();
        text = searchTerm.text();
        setEnum(reader.terms(new Term(searchTerm.field(), text)));
    }

    protected final boolean termCompare(Term term)
        if (field == term.field()) {
            String searchText = term.text();
            return (searchText.indexOf(text)>=0);
        }
        endEnum = true;
        return false;
    }

    public final float difference() {
        return 1.0f;
    }

    public final boolean endEnum() {
        return endEnum;
    }

    public void close() throws IOException {
        super.close();
        searchTerm = null;
        field = null;
        text = null;
    }
}


Günter

----- Original Message -----
From: "Karsten Konrad" <Karsten.Konrad@xtramind.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Friday, May 16, 2003 5:32 PM
Subject: AW: German zusammengesetzte Hauptwörter



Hi,

without any sophisticated linguistic techniques (German word
decomposition is not a simple topic), there are some other options
as well:

(1) Use of an analyzer that stores each word both in the normal
direction and backwards at the same token position. Ie., the word
ausweis will be indexed both as 'ausweis' and 'siewsua'. Then, for
every word you find in a query (without bolean operations) that
has a * at the beginning, simply reverse the word and add the *
to the end. Thus, *ausweis will find, for instance, siewsualanosrep;
that is personalausweis backwards. However, highlighting could be
trickier than usual. Another drawback: the number of tokens in the
index doubles.

(2) Use a term enumerator. The WildcardTermEnum gives a good
example of how you can iterate over all terms of an index
and find matching terms for any kind of similarity operation.
Thus, you could take your search word and return every word
where the search word occurs in. On a larger index with lots of
tokens, this operation could be expensive with a naive
implementation, but worth a try.

Regards,

Karsten


-----Ursprüngliche Nachricht-----
Von: Test2.Schwab@Linde-LE.com [mailto:Test2.Schwab@Linde-LE.com]
Gesendet: Freitag, 16. Mai 2003 12:00
An: Lucene Users List
Betreff: Re: German zusammengesetzte Hauptwörter



Günter,

As far I am informed, you can not search with bothside wildcards like
*ausweis*.
You can search for ausweis*. However I think, you won't get the results you
are searching for, because your words have a leftside term.
If you have words like
 Ausweisinhaber
Ausweiskontrolle
you can search with ausweis*

Regards,
Arsineh



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message