lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: charFilter
Date Thu, 13 Sep 2012 12:08:49 GMT
Hi,

You must *implement* the protected method correct(int offset) in your own charFilter, that
does the following: call super.correct(offset) - (this is important if you chain several filters)
and then return a corrected offset according to the transformations you did in your own charfilter.
If e.g. the character at offset 3 corresponds to offset 5 in the filtered data, you must return
5 when the given offset (after calling super) is 3.

Unrelated to that: Catching the IOException and printing it to system out is suboptimal to
implement such a filter. Just make your constructor throw IOException itself, so it bubbles
up to Solr. In the factory you can re-throw a SolrException. Your code would silently index
nonsense or NPE later.

In general, a CharFilter should *not* read the whole input up-front in constructor and then
transform it, instead it should implement the read(...) methods and transform the input on-the-fly.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Osullivan L. [mailto:L.Osullivan@swansea.ac.uk]
> Sent: Thursday, September 13, 2012 12:43 PM
> To: general@lucene.apache.org
> Subject: RE: charFilter
> 
> Hi Folks,
> 
> I'm getting the following error after using a custom filter:
> 
> SEVERE: org.apache.solr.common.SolrException:
> org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token PR
> 2823.000000 A0.200000 S0.819880 exceeds length of provided text sized 15
> 
> As the error suggests, the input value is PR2823.A2S81988 (15 chars). I have
> been informed that correctOffset() method of the CharFilter class can be used
> to resolve this issue but as far as I can tell, all that does is return the value - it
> doesn't set it.
> 
> I have included some details below.
> 
> Kind Regards,
> 
> Luke
> 
> In my schema I have:
> 
>     <fieldType name="LCNormalized" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
>         <analyzer>
>           <charFilter class="com.test.solr.analysis.LukesTestCharFilterFactory"/>
>           <tokenizer class="solr.KeywordTokenizerFactory"/>
>         </analyzer>
>     </fieldType>
> 
> and the method is:
> 
> public class LukesTestCharFilterFactory extends BaseCharFilterFactory {
> 
> 	public CharStream create(CharStream input) {
> 		return new LukesTestCharFilter(input);
> 	}
> }
> 
> public final class LukesTestCharFilter extends BaseCharFilter {  ...
>   public LukesTestCharFilter(CharStream input)  {
> 	  super(input);
> 	  try {
>           // Load the whole input into a string
>           StringBuilder sb = new StringBuilder();
>           char[] buf = new char[1024];
> 
>           int len;
>           while ((len = input.read(buf)) >= 0) {
>               sb.append(buf, 0, len);
>           }
> 
>           String original = sb.toString();
>           String modified = getLCShelfkey(original);
>           CharStream result = CharReader.get(new StringReader(modified));
> 
>           this.input = result;
>           this.input.correctOffset(modified.length());
>       } catch (IOException e) {
>           System.err.println("There was a problem parsing input.  Skipping.");
>       }
>   }
>  ...
> }
> =


Mime
View raw message