Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 36A9BD351 for ; Fri, 24 Aug 2012 19:49:18 +0000 (UTC) Received: (qmail 92273 invoked by uid 500); 24 Aug 2012 19:49:15 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 92226 invoked by uid 500); 24 Aug 2012 19:49:15 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 92218 invoked by uid 99); 24 Aug 2012 19:49:15 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Aug 2012 19:49:15 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [204.194.78.37] (HELO mailserver1.caci.com) (204.194.78.37) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Aug 2012 19:49:09 +0000 Received: from verizondibmail.caci.com (HELO ex2010ch03.caci.com) ([172.16.246.18]) by mailserver1.caci.com with ESMTP/TLS/AES128-SHA; 24 Aug 2012 15:48:42 -0400 Received: from EX2010MB01-1.caci.com ([fe80::d5c4:c244:1486:79fc]) by ex2010ch03.caci.com ([::1]) with mapi id 14.01.0379.000; Fri, 24 Aug 2012 15:48:46 -0400 From: Ilya Zavorin To: "java-user@lucene.apache.org" Subject: Efficient string lookup using Lucene Thread-Topic: Efficient string lookup using Lucene Thread-Index: Ac2CMXhOot/bT+OdTRyEGD8oonSgTQ== Date: Fri, 24 Aug 2012 19:48:45 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.29.10.35] Content-Type: multipart/alternative; boundary="_000_A57498EDEC10C64781EA0F7DBA665CEF27C874DFex2010mb011caci_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_A57498EDEC10C64781EA0F7DBA665CEF27C874DFex2010mb011caci_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hi Everyone, I have the following task. I have a set of documents in multiple languages.= I don't know what these languages are. Any given doc may contain text in s= everal languages mixed up. So to me these are just a bunch of Unicode text = files. What I need is to implement an efficient EXACT string lookup. That is, I ne= ed to be able to find ANY Unicode string exactly as it appears. I do not ca= re about language-specific modifications of the string. That is, if I searc= h for a string "run", I do not need to find "ran" but I do want to find it = in all of these strings below: Fox is running fast !%#^&$run!$!%@&$# run,run Is there a way of using StandardAnalyzer or any other analyzer and the corr= esponding query parser to find these? Again, my queries might be more or le= ss random Unicode sequences and I need to find all their accurrences in the= text. Essentially, what I am trying to do is implement substring matching more ef= ficiently that using Java's standard substring matching methods. Thanks! Ilya Zavorin --_000_A57498EDEC10C64781EA0F7DBA665CEF27C874DFex2010mb011caci_--