Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
From: Ilya Zavorin <izavorin@caci.com>
To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
Subject: Efficient string lookup using Lucene
Thread-Topic: Efficient string lookup using Lucene
Thread-Index: Ac2CMXhOot/bT+OdTRyEGD8oonSgTQ==
Date: Fri, 24 Aug 2012 19:48:45 +0000
Message-ID: <A57498EDEC10C64781EA0F7DBA665CEF27C874DF@ex2010mb01-1.caci.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: multipart/alternative;
	boundary="_000_A57498EDEC10C64781EA0F7DBA665CEF27C874DFex2010mb011caci_"
MIME-Version: 1.0

--_000_A57498EDEC10C64781EA0F7DBA665CEF27C874DFex2010mb011caci_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Hi Everyone,

I have the following task. I have a set of documents in multiple languages.=
 I don't know what these languages are. Any given doc may contain text in s=
everal languages mixed up. So to me these are just a bunch of Unicode text =
files.

What I need is to implement an efficient EXACT string lookup. That is, I ne=
ed to be able to find ANY Unicode string exactly as it appears. I do not ca=
re about language-specific modifications of the string. That is, if I searc=
h for a string "run", I do not need to find "ran" but I do want to find it =
in all of these strings below:

Fox is running fast
!%#^&$run!$!%@&$#
run,run

Is there a way of using StandardAnalyzer or any other analyzer and the corr=
esponding query parser to find these? Again, my queries might be more or le=
ss random Unicode sequences and I need to find all their accurrences in the=
 text.

Essentially, what I am trying to do is implement substring matching more ef=
ficiently that using Java's standard substring matching methods.

Thanks!

Ilya Zavorin

--_000_A57498EDEC10C64781EA0F7DBA665CEF27C874DFex2010mb011caci_--