lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Gubarkov <xon...@gmail.com>
Subject Two questions on RussianAnalyzer
Date Thu, 19 Apr 2012 11:26:11 GMT
Hi,

Upon updating to Lucene 3.6 I've noticed that new RussianAnalyzer
analyzes not the same way as before.

Please see example:

    private List<String> getTokens(Analyzer theAnalyzer, String str)
throws IOException {
        final TokenStream tokenStream =
theAnalyzer.tokenStream(MessageFields.BODY, new StringReader(str));

        tokenStream.reset();

        final CharTermAttribute termAttribute =
tokenStream.getAttribute(CharTermAttribute.class);

        List<String> tokens = new LinkedList<String>();

        while (tokenStream.incrementToken()) {
            final String term = new String(termAttribute.buffer(), 0,
termAttribute.length());
            tokens.add(term);
//            System.out.println(">>" + term);
        }
        return tokens;
    }

    @Test
    public void testDots() throws IOException {
        final String str = "aaa.bbb.com:8888 " +
                "a,b;c/d'e$f&g*h+i-j%k/l_m#n@o!p?q>r\"s~t(u`v|z}y\\z";

        System.out.println("New analyzer:");
        System.out.println(getTokens(new
RussianAnalyzer(Version.LUCENE_36), str));

        System.out.println("Old analyzer:");
        System.out.println(getTokens(new
RussianAnalyzer(Version.LUCENE_30), str));
    }

This shows:

New analyzer:
[aaa.bbb.com, 8888, a, b, c, d'e, f, g, h, i, j, k, l_m, n, o, p, q,
r, s, t, u, v, z, y, z]
Old analyzer:
[aaa, bbb, com, 8888, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p,
q, r, s, t, u, v, z, y, z]

Please note the differences.

The most uncomfortable in new behaviour to me is that in past I used
to search by subdomain like
bbb.com:8888
and have displayed results with www.bbb.com:8888, aaa.bbb.com:8888 and
so on. Now I have 0 results.

My questions are: 1) it this change is by design (not a mistake) and
2) is the only option to achieve old behaviour is to use
Version.LUCENE_30 for creating analyzer?

The other problem with RussionAnalyzer is with the letter Yo
http://en.wikipedia.org/wiki/Yo_(Cyrillic) which in russian often
replaced by letter Ye http://en.wikipedia.org/wiki/Ye_(Cyrillic), and
such words are considered same.
What I want to achieve is that my search by word with yo also yield
words with this letter replaced to ye (and vice-versa).

What I'm currently doing is roughly next:

// NOTE: I have to define my class in this package, because method
russianAnalyzer.createComponents is protected
package org.apache.lucene.analysis.ru;

public class RussianAnalyzerImproved extends ReusableAnalyzerBase{
    private RussianAnalyzer russianAnalyzer = new
RussianAnalyzer(LuceneVersion.VERSION);

    @Override
    protected Reader initReader(Reader reader) {
        return new YoCharFilter(CharReader.get(reader));
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
        return russianAnalyzer.createComponents(fieldName, reader);
    }
}

public class YoCharFilter extends CharFilter {
    public YoCharFilter(CharStream in) {
        super(in);
    }

    @Override
    public int read(char[] cbuf, int off, int len) throws IOException {
        final int charsRead = super.read(cbuf, off, len);
        if (charsRead > 0) {
            final int end = off + charsRead;
            while (off < end) {
                if (cbuf[off] == 'ё' || cbuf[off] == 'Ё')
                    cbuf[off] = 'е';
                off++;
            }
        }
        return charsRead;
    }
}

But I'm not sure this is the correct approach.
What do you think?
Maybe it may have sense to add a configuration option to
RussianAnalyzer itself (distinguish or not yo & ye)?


Sincerely yours,
Vladimir

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message