lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Naomi Dushay <ndus...@stanford.edu>
Subject Re: [solrmarc-tech] apostrophe / ayn / alif
Date Thu, 24 May 2012 17:41:26 GMT
The alif and ayn can also be used as diacritic-like characters in Korean;  this is a known
practice.   But thanks anyway.

On May 24, 2012, at 9:30 AM, Charles Riley wrote:

> Hi Naomi,
> 
> I don't have a conclusive answer for you on this yet, but let me pick up on a few points.
> 
> First, the apostrophe is probably being handled through ignoring punctuation in the ICUCollationKeyFilterFactory.
 
> 
> Alif isn't a diacritic but a letter, and its character properties would be handled as
such, apparently also outside the scope of what the folding filter factory does unless it's
tailored.
> 
> From the solrwiki, this looks like a helpful rule of thumb:
> 
> "When To use a CharFilter vs a TokenFilter
> There are several pairs of CharFilters and TokenFilters that have related (ie: MappingCharFilter
and ASCIIFoldingFilter) or nearly identical functionality (ie: PatternReplaceCharFilterFactory
and PatternReplaceFilterFactory) and it may not always be obvious which is the best choice.
> 
> The ultimate decision depends largely on what Tokenizer you are using, and whether you
need to "out smart" it by preprocessing the stream of characters.
> 
> For example, maybe you have a tokenizer such as StandardTokenizer and you are pretty
happy with how it works overall, but you want to customize how some specific characters behave.
> 
> In such a situation you could modify the rules and re-build your own tokenizer with javacc,
but perhaps its easier to simply map some of the characters before tokenization with a CharFilter."
> 
> 
> Charles    
> 
> On Tue, May 15, 2012 at 2:47 PM, Naomi Dushay <ndushay@stanford.edu> wrote:
> We are using the ICUFoldingFilterFactory with great success to fold diacritics so searches
with and without the diacritics get the same results.
> 
> We recently discovered we have some Korean records that use an alif diacritic instead
of an apostrophe, and this diacritic is NOT getting folded.   Has anyone experienced this
for alif or ayn characters?   Do you have a solution?
> 
> 
> - Naomi
> 
> --
> You received this message because you are subscribed to the Google Groups "solrmarc-tech"
group.
> To post to this group, send email to solrmarc-tech@googlegroups.com.
> To unsubscribe from this group, send email to solrmarc-tech+unsubscribe@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.
> 
> 
> 
> 
> -- 
> Charles L. Riley
> Catalog Librarian for Africana
> Sterling Memorial Library, Yale University
> <zenodotus@gmail.com>
> 203-432-7566
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups "solrmarc-tech"
group.
> To post to this group, send email to solrmarc-tech@googlegroups.com.
> To unsubscribe from this group, send email to solrmarc-tech+unsubscribe@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message