lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Zavorin <>
Subject Efficient string lookup using Lucene
Date Fri, 24 Aug 2012 19:48:45 GMT
Hi Everyone,

I have the following task. I have a set of documents in multiple languages. I don't know what
these languages are. Any given doc may contain text in several languages mixed up. So to me
these are just a bunch of Unicode text files.

What I need is to implement an efficient EXACT string lookup. That is, I need to be able to
find ANY Unicode string exactly as it appears. I do not care about language-specific modifications
of the string. That is, if I search for a string "run", I do not need to find "ran" but I
do want to find it in all of these strings below:

Fox is running fast

Is there a way of using StandardAnalyzer or any other analyzer and the corresponding query
parser to find these? Again, my queries might be more or less random Unicode sequences and
I need to find all their accurrences in the text.

Essentially, what I am trying to do is implement substring matching more efficiently that
using Java's standard substring matching methods.


Ilya Zavorin

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message