lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sarah Hunter" <corian...@gmail.com>
Subject StandardAnalyzer Problem with Apostrophes
Date Mon, 13 Nov 2006 21:51:09 GMT
Hi there,
Any ideas you have about the following would be greatly appreciated.

I'd like apostropes to break up a word into two for indexing - ie, the
french l'observatoire would be indexed as two separate tokens, l
observatoire. My understanding from reading documentation and list
archives is that StandardAnalyzer should do this. However, it is not
working that way for me, and l'observatoire is indexing as one word.
Interstingly, l`observatoire (ie, with the other, less common
apostrophe) is indexing properly, as l observatoire .

Here is the test I've written and the output I'm getting.

TEST


import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import java.io.StringReader;
import java.io.IOException;

public class LuceneTokenizingTest {

   static Analyzer analyzer;

   public static void main(String[] args) throws IOException {
       analyzer =  new StandardAnalyzer(new String[] {});

       testTokenizeUnusualApostrophe();
       testTokenizeUsualApostrophe();
   }

       public static void testTokenizeUnusualApostrophe() {
               System.out.println( "Testing how l`observatoire is tokenized." );
               System.out.println( "Expecting: [l] [observatoire] ");
               System.out.println( "Getting: " +
analyze("l`observatoire") + "\n\n");
       }

       public static void testTokenizeUsualApostrophe() {
               System.out.println( "Testing how l'observatoire is tokenized." );
               System.out.println( "Expecting: [l] [observatoire] ");
               System.out.println( "Getting: " + analyze("l'observatoire") );
       }

   private static String analyze(String text) {
       String returnString="";
       try{
               TokenStream stream = analyzer.tokenStream("contents", new
StringReader(text));
               while (true) {
                   Token token = stream.next();
                   if (token == null) break;
                       returnString = returnString + "[" +
token.termText() + "] ";
                   }
       }catch(IOException e){
               System.out.println("Exception: " + e.toString());
       }
       return returnString;
   }
}



OUTPUT

Testing how l`observatoire is tokenized.
Expecting: [l] [observatoire]
Getting: [l] [observatoire]


Testing how l'observatoire is tokenized.
Expecting: [l] [observatoire]
Getting: [l'observatoire]




Thanks so much for your help!

Sarah

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message