lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Milind <mili...@gmail.com>
Subject KeywordAnalyzer still getting tokenized on spaces
Date Tue, 09 Sep 2014 00:04:57 GMT
I thought I could use the KeywordTokenizer to prevent tokenizing on spaces.
so I can treat some fields as a single term.  But it's still tokenizing on
spaces.

In the code below, I'm storing a document with a serial number containing
spaces.  I want to treat it as a single term without having end users
making it a phrase query by surrounding it with double quotes.  But it
doesn't work as I thought it would.  Is there something I need to be doing
differently?  Shouldn't the keyword tokenizer treat the entire text as one
token?

------------
This is the custom analyzer class I use.

    private static class LowerCaseKeywordAnalyzer extends Analyzer
    {
        @Override
        protected TokenStreamComponents createComponents(String
theFieldName,
                                                         Reader theReader)
        {
            Tokenizer theTokenizer = new KeywordTokenizer(theReader);
            TokenStream theTokenStream =
                new LowerCaseFilter(Version.LUCENE_46, theTokenizer);
            TokenStreamComponents theTokenStreamComponents =
                new TokenStreamComponents(theTokenizer, theTokenStream);
            return theTokenStreamComponents;
        }
    }

The code using the analyzer

    Version theVersion = Version.LUCENE_46;
    Directory theIndex = new RAMDirectory();

    Analyzer theAnalyzer = new LowerCaseKeywordAnalyzer();
    IndexWriterConfig theConfig =
        new IndexWriterConfig(theVersion, theAnalyzer);

    IndexWriter theWriter = new IndexWriter(theIndex, theConfig);
    Document theDocument = new Document();
    FieldType theFieldType = new FieldType();
    theFieldType.setStored(true);
    theFieldType.setIndexed(true);
    theFieldType.setTokenized(false);
    theDocument.add(new Field("sn", "1023 4567 8765", theFieldType));
    theWriter.addDocument(theDocument);
    theWriter.close();

    String[] theQueryStrings = new String[]
      {
           "\"1023 4567 8765\"",
           "1023 4567 8765"
      };

    QueryParser theParser = new QueryParser(theVersion, "sn", theAnalyzer);
    IndexReader theIndexReader = DirectoryReader.open(theIndex);
    IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
    for (int i = 0; i < theQueryStrings.length; i++) {
        String currQueryStr = theQueryStrings[i];
        Query currQuery = theParser.parse("sn:" + currQueryStr);
        System.out.println(currQuery.getClass() + ", " + currQuery);
        TopScoreDocCollector currCollector =
            TopScoreDocCollector.create(10, true);
        theSearcher.search(currQuery, currCollector);
        ScoreDoc[] currHits = currCollector.topDocs().scoreDocs;
        String msg = "Number of results found for '" + currQueryStr +
                     "': "  + currHits.length;
        System.out.println(msg);
    }

The output

    class org.apache.lucene.search.TermQuery, sn:1023 4567 8765
    Number of results found for '"1023 4567 8765"': 1
    class org.apache.lucene.search.BooleanQuery, sn:1023 sn:4567 sn:8765
    Number of results found for '1023 4567 8765': 0

-- 
Regards
Milind

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message