Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8FA349C26 for ; Thu, 19 Apr 2012 11:26:42 +0000 (UTC) Received: (qmail 73296 invoked by uid 500); 19 Apr 2012 11:26:40 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 73205 invoked by uid 500); 19 Apr 2012 11:26:39 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 73194 invoked by uid 99); 19 Apr 2012 11:26:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Apr 2012 11:26:39 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of xonixx@gmail.com designates 209.85.213.176 as permitted sender) Received: from [209.85.213.176] (HELO mail-yx0-f176.google.com) (209.85.213.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Apr 2012 11:26:32 +0000 Received: by yenl3 with SMTP id l3so5468999yen.35 for ; Thu, 19 Apr 2012 04:26:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=wEP0pkNOXZ5ey6FZ42GA963+bvQZ8C+AKLhMPj8sYu8=; b=Ia17Zw0nPdfMHIpTTkhMeHP8dNaZB3oGzO16+N7AQKoZ90t33fdF7DRt/jtl3wUbaN orSMV5nPFF/CjgIhIIChennI/21Hn80KKV1pL3OAdCFw0fMm1AaeLd4iFR/61MKh+bap cIOBuDuHlhkqa3zr7qHboxeQ856VBBnVigIdb/E9FUKbFCw8XZbh+Hq4Za48arQ23vY3 UOxeJWMie6PtAD/9oWuSOjAS6BgpEYikSe9DviRIuLNFuLq7vZW57dETfVQrs3XxO60b bGwKoNuMD93kzO9NUja0UTsx3nbYXS2m3TQV8ZULXyZAKms7LSXpDSLfEfdC+HbifY8d UJvg== MIME-Version: 1.0 Received: by 10.101.179.33 with SMTP id g33mr491339anp.18.1334834771133; Thu, 19 Apr 2012 04:26:11 -0700 (PDT) Received: by 10.100.111.13 with HTTP; Thu, 19 Apr 2012 04:26:11 -0700 (PDT) Date: Thu, 19 Apr 2012 15:26:11 +0400 Message-ID: Subject: Two questions on RussianAnalyzer From: Vladimir Gubarkov To: java-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi, Upon updating to Lucene 3.6 I've noticed that new RussianAnalyzer analyzes not the same way as before. Please see example: private List getTokens(Analyzer theAnalyzer, String str) throws IOException { final TokenStream tokenStream =3D theAnalyzer.tokenStream(MessageFields.BODY, new StringReader(str)); tokenStream.reset(); final CharTermAttribute termAttribute =3D tokenStream.getAttribute(CharTermAttribute.class); List tokens =3D new LinkedList(); while (tokenStream.incrementToken()) { final String term =3D new String(termAttribute.buffer(), 0, termAttribute.length()); tokens.add(term); // System.out.println(">>" + term); } return tokens; } @Test public void testDots() throws IOException { final String str =3D "aaa.bbb.com:8888 " + "a,b;c/d'e$f&g*h+i-j%k/l_m#n@o!p?q>r\"s~t(u`v|z}y\\z"; System.out.println("New analyzer:"); System.out.println(getTokens(new RussianAnalyzer(Version.LUCENE_36), str)); System.out.println("Old analyzer:"); System.out.println(getTokens(new RussianAnalyzer(Version.LUCENE_30), str)); } This shows: New analyzer: [aaa.bbb.com, 8888, a, b, c, d'e, f, g, h, i, j, k, l_m, n, o, p, q, r, s, t, u, v, z, y, z] Old analyzer: [aaa, bbb, com, 8888, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, z, y, z] Please note the differences. The most uncomfortable in new behaviour to me is that in past I used to search by subdomain like bbb.com:8888 and have displayed results with www.bbb.com:8888, aaa.bbb.com:8888 and so on. Now I have 0 results. My questions are: 1) it this change is by design (not a mistake) and 2) is the only option to achieve old behaviour is to use Version.LUCENE_30 for creating analyzer? The other problem with RussionAnalyzer is with the letter Yo http://en.wikipedia.org/wiki/Yo_(Cyrillic) which in russian often replaced by letter Ye http://en.wikipedia.org/wiki/Ye_(Cyrillic), and such words are considered same. What I want to achieve is that my search by word with yo also yield words with this letter replaced to ye (and vice-versa). What I'm currently doing is roughly next: // NOTE: I have to define my class in this package, because method russianAnalyzer.createComponents is protected package org.apache.lucene.analysis.ru; public class RussianAnalyzerImproved extends ReusableAnalyzerBase{ private RussianAnalyzer russianAnalyzer =3D new RussianAnalyzer(LuceneVersion.VERSION); @Override protected Reader initReader(Reader reader) { return new YoCharFilter(CharReader.get(reader)); } @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { return russianAnalyzer.createComponents(fieldName, reader); } } public class YoCharFilter extends CharFilter { public YoCharFilter(CharStream in) { super(in); } @Override public int read(char[] cbuf, int off, int len) throws IOException { final int charsRead =3D super.read(cbuf, off, len); if (charsRead > 0) { final int end =3D off + charsRead; while (off < end) { if (cbuf[off] =3D=3D '=D1=91' || cbuf[off] =3D=3D '=D0=81') cbuf[off] =3D '=D0=B5'; off++; } } return charsRead; } } But I'm not sure this is the correct approach. What do you think? Maybe it may have sense to add a configuration option to RussianAnalyzer itself (distinguish or not yo & ye)? Sincerely yours, Vladimir --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org