lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <>
Subject Re: Efficient string lookup using Lucene
Date Fri, 24 Aug 2012 20:52:11 GMT
I can't speak for any non-Latin languages, but how about simply using the 
StandardAnalyzer plus the EdgeNGramFilter for indexing (but not query.) The 
latter would allow a query of "run" to match "running".

-- Jack Krupansky

-----Original Message----- 
From: Ilya Zavorin
Sent: Friday, August 24, 2012 3:48 PM
Subject: Efficient string lookup using Lucene

Hi Everyone,

I have the following task. I have a set of documents in multiple languages. 
I don't know what these languages are. Any given doc may contain text in 
several languages mixed up. So to me these are just a bunch of Unicode text 

What I need is to implement an efficient EXACT string lookup. That is, I 
need to be able to find ANY Unicode string exactly as it appears. I do not 
care about language-specific modifications of the string. That is, if I 
search for a string "run", I do not need to find "ran" but I do want to find 
it in all of these strings below:

Fox is running fast

Is there a way of using StandardAnalyzer or any other analyzer and the 
corresponding query parser to find these? Again, my queries might be more or 
less random Unicode sequences and I need to find all their accurrences in 
the text.

Essentially, what I am trying to do is implement substring matching more 
efficiently that using Java's standard substring matching methods.


Ilya Zavorin 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message