lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: ASCIIFoldingFilterFactory
Date Fri, 06 Jun 2014 00:48:21 GMT
Hi Michael,

Questions about Solr should go to the Solr user mailing list, rather than this list, which
is for Lucene users - see <http://lucene.apache.org/solr/discussion.html> for how to
subscribe.

I’ve never heard of ASCIIFoldingExpansionFilterFactory, but ASCIIFoldingFilterFactory has
a new option “preserveOriginal”, introduced in Lucene/Solr 4.7 by LUCENE-5437 <https://issues.apache.org/jira/browse/LUCENE-5437>,
that should do the trick.

Just add preserveOriginal=“true” - see the example in the javadocs (if you copy/paste
it, make sure you change the attribute value from “false”, as it is in the example, to
“true”): <http://lucene.apache.org/core/4_8_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilterFactory.html>

Note that as Ahmet Arslan points out on LUCENE-5437, though, queries that generate multiple
terms (e.g. prefix and regex queries) will trigger a failure.  You can work around this problem
by defining both “index" and “query" analyzer types for the fieldtype you use with this
field, and only use preserveOriginal=“true” on the “index” analyzer type.

See this page on the Solr Reference Guide for more info about analyzers in Solr: <https://cwiki.apache.org/confluence/display/solr/What+Is+An+Analyzer%3F>.

Steve

On Jun 5, 2014, at 8:05 PM, Michael Tobias <michael@tobias.org.uk> wrote:

> Hi there
> 
> I am a relative newbie Solr user so please be gentle with me.
> 
> I am experimenting with various phonetic filters and the tokens created can
> vary depending on whether the words contain diacritical characters.
> 
> My problem is that the documents being indexed are not always consistent in
> terms of the use of diacritics (sometimes the same word can have diacritics
> and sometimes not) and of course when users submit  queries they may or may
> not use diacritics properly.
> 
> If I wasn't trying to use phonetic matching I would simply use the
> ASCIIFoldingFilterFactory to remove any problem characters and match on
> that.
> 
> What I would like to do is create phonetic tokens for both the
> diacritic-version of the word and the folded-version of the word - but I
> would like to store the tokens in a single phonetic field for querying
> purposes.....
> 
> How can I achieve that????
> 
> I did find a few references online to "ASCIIFoldingExpansionFilterFactory"
> which appears to do what I want - when creating the 'folded' version of a
> word it appears to keep the diacritic version too. I could then apply my
> phonetic filter to those expanded tokens.
> 
> Is there any other way to do this?  Or if ASCIIFoldingExpansionFilterFactory
> is the only or easiest way to do the job can somebody tell me HOW to
> incorporate that into my Solr setup????
> 
> Many thanks!!
> 
> Michael
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message