lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Digy" <>
Subject RE: Search in non-linguistic text
Date Thu, 16 Jul 2009 18:03:53 GMT
Another approach could be  splitting the text into chars and returning each
char as a token(in a custom analyzer).

For ex: for the document [some text]
Tokens would be [s] [o] [m] [e]       [t] [e] [x] [t] and searches such as
[ome] or [ex] would get hits.

Sample code written in C# is below:


-----Original Message-----
From: Matthew Hall [] 
Sent: Thursday, July 16, 2009 4:36 PM
Subject: Re: Search in non-linguistic text

Assuming your dataset isn't incredibly large, I think you could.. cheat 
here, and optimize your data for searching.

Am I correct in assuming that BC, should also match on ABCD?

If so, then yes your current thoughts on the problems that you face are 
correct, and everything you do will be turning into a contains search, 
which is yes.. not the best performance you have ever seen.

However, knowing this, you can manipulate your data in such a way, that 
you can get around that limitation, and turn everything into a prefix 
(or postfix) search if you so prefer.

So here's what you do:

When you are indexing the term ABCD, you are actually going to add 
several documents into the index (or into various special purpose 
indexes, if you so prefer.. but more on that later on)

Lets say you want to turn everything into a prefix search under the covers.

In the index you would store the following values, all of which point at 
the document "ABCD"


Then, when you do your search for the terms "BC" you will really be 
searching on "BC*", which will produce a match to the second document. 

Now Lucene documents can be considered as giant data holding object, you 
can and SHOULD have fields in the document that are not used at search 
time, but ARE used at display generation time (or whatever layer feeds 
your display, if you are going in a more OO fashion).

Now this technique isn't without its drawbacks of course, you will see 
an increase in your index size, but unless you are playing around with 
some VERY large datasets that really shouldn't matter.

Now, if I was the one implementing this, I would probably make at least 
two indexes, one for exact punctuation relevant data.  The other index 
would contain the data that I've described above, with one important 
difference, any and all punctuation (including whitespace) has been 
removed, and all of the letters in your codes were collapsed down into a 
single word.  That way you can perform two searches, and ensure that 
exact punctuation relevant matches will appear higher in your results 
list than non punctuation relevant ones.

Anyhow, that's pretty much it in a nutshell.  I think this technique 
should work for you, after you have decided

JesL wrote:
> Hello,
> Are there any suggestions / best practices for using Lucene for searching
> non-linguistic text?  What I mean by non-linguistic is that it's not
> or any other language, but rather product codes.  This is presenting some
> interesting challenges.  Among them are the need for pretty lax wildcard
> searches.  For example, ABC should match on ABCD, but so should BCD.
> it needs to be agnostic to special characters.  So, ABC/D should match
> as well as ABC-D or "ABC D".
> As I write an analyzer to handle these cases, I seem to be pretty quickly
> degrading into a "like '%blah%' search, with rules to treat all special
> characters as single-character, optional wildcards.  I'm concerned that
> performance of this will be disappointing, though.
> Any help would be much appreciated.  Thanks!
> - Jes

Matthew Hall
Software Engineer
Mouse Genome Informatics
(207) 288-6012

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message