lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wulf Berschin <bersc...@dosco.de>
Subject How to index part numbers
Date Fri, 28 Jan 2011 12:05:44 GMT
Hi,

I'm poking in the dark and hope someone has some light...

We have part numbers in technical documentation to retrieve. For now we 
have a (long) regular expression to find those in a string. The part 
numbers have letters, digits and (redundant) whitespace. Furthermore 
authors often used a compressed notation for number ranges with dashes 
or slashes, like A123-56 or A123/4.

When searching for part numbers users should be able to enter specific 
numbers like A126 (then the text "A123-56" should be found too) or 
wildcard searches like "A12?" or "A*". This part number seach is a 
separate feature apart from regular full text search.

As far I see I have to

- add an extra field for storing part numbers

- create a Tokenizer which recognizes just the part numbers and skips 
all other text

- create an Analyzer which expands ranges like A123-56 to A123, A124, 
..., A156 and normalizes numbers by remving whitespace

With this analyzer I hope to get the highlighting to work too (e.g. 
"A123-56" highlighted when "A126" was the search term).

Is this the right way? What could I use as starting point (I found 
org.apache.lucene.analysis.miscellaneous.PatternAnalyzer which does much 
more than I need...)

Thanks for all hints!

Wulf


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message