lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Indexing puncutation
Date Wed, 29 Jun 2005 14:39:13 GMT
>I do a vaguely similar thing;  I have to strip accents from 
>characters such as e-acute out of both my input data and my incoming 
>search queries to put them into a standard form.  I do this with a 
>custom TokenFilter subclass.  I have an analyzer that includes this 
>filter along with some of the standard ones (LowercaseFilter, etc). 
>I run the same analyzer on indexing and searching, which has been 
>discussed in other posts.

For a hard-core approach to this problem, you could try converting 
all text to Unicode first, then use the ICU package to create a level 
0 "sort key" (the C API is col_getSortKey). This will be a string 
suitable for comparison to determine weak equality, but you can also 
just index it as a regular token.

There are some subtle issues w/locale-specific behavior of the sort 
key generation step, where you could guess at the right locale to use 
for the conversion, but in general that shouldn't matter.

Two other issues are code/data size (ICU can be big) and the 
performance hit while indexing documents.

-- Ken



>Aigner, Thomas wrote:
>
>>Hello all,
>>
>>	I am VERY new to Lucene and we are trying out Lucene to see if
>>it will accomplish the vast majority of our search functions.
>>
>>	I have a question about a good way to index some of our product
>>description codes.  We have description codes like 21-MA-GAB and other
>>punctuation.  Our users need to be able to search for "21 MA GAB" 
>>or "21-MA_GAB" or "21MAGAB".  Is the best way to accomplish this by
>>creating synonyms for the 3 different ways when punctuation is in parts
>>to search for? I know I can stop punctuation in the index but what about
>>grouping the information together or with spaces?
>>
>>Thanks all in advance,
>>Tom


-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message