lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sarah Hunter" <>
Subject StandardAnalyzer Problem with Apostrophes
Date Mon, 13 Nov 2006 21:51:09 GMT
Hi there,
Any ideas you have about the following would be greatly appreciated.

I'd like apostropes to break up a word into two for indexing - ie, the
french l'observatoire would be indexed as two separate tokens, l
observatoire. My understanding from reading documentation and list
archives is that StandardAnalyzer should do this. However, it is not
working that way for me, and l'observatoire is indexing as one word.
Interstingly, l`observatoire (ie, with the other, less common
apostrophe) is indexing properly, as l observatoire .

Here is the test I've written and the output I'm getting.


import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.standard.StandardAnalyzer;

public class LuceneTokenizingTest {

   static Analyzer analyzer;

   public static void main(String[] args) throws IOException {
       analyzer =  new StandardAnalyzer(new String[] {});


       public static void testTokenizeUnusualApostrophe() {
               System.out.println( "Testing how l`observatoire is tokenized." );
               System.out.println( "Expecting: [l] [observatoire] ");
               System.out.println( "Getting: " +
analyze("l`observatoire") + "\n\n");

       public static void testTokenizeUsualApostrophe() {
               System.out.println( "Testing how l'observatoire is tokenized." );
               System.out.println( "Expecting: [l] [observatoire] ");
               System.out.println( "Getting: " + analyze("l'observatoire") );

   private static String analyze(String text) {
       String returnString="";
               TokenStream stream = analyzer.tokenStream("contents", new
               while (true) {
                   Token token =;
                   if (token == null) break;
                       returnString = returnString + "[" +
token.termText() + "] ";
       }catch(IOException e){
               System.out.println("Exception: " + e.toString());
       return returnString;


Testing how l`observatoire is tokenized.
Expecting: [l] [observatoire]
Getting: [l] [observatoire]

Testing how l'observatoire is tokenized.
Expecting: [l] [observatoire]
Getting: [l'observatoire]

Thanks so much for your help!


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message