lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Long <jeremy.l...@gmail.com>
Subject WordDelimiterFilter Question (lucene 4.0)
Date Sun, 23 Dec 2012 16:56:52 GMT
Hello,

I'm having an issue creating a custom analyzer utilizing the
WordDelimiterFilter. I'm attempting to create an index of information
gleaned from JAR manifest files. So if I have "spring-framework" I need the
following tokens indexed: "spring" "springframework" "framework"
"spring-framework". My understanding is that the WordDelimiterFilter is
perfect for this. However, when I introduce the filter to the analyzer I
don't seem to get any documents indexed correctly.

Here is the analyzer:

import java.io.Reader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter;
import org.apache.lucene.util.Version;

public class FieldAnalyzer extends Analyzer {

    private Version version = null;

    public FieldAnalyzer(Version version) {
        this.version = version;
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {

        Tokenizer source = new WhitespaceTokenizer(version, reader);
        TokenStream stream = source;

        stream = new WordDelimiterFilter(stream,
                WordDelimiterFilter.CATENATE_WORDS
                & WordDelimiterFilter.GENERATE_WORD_PARTS
                & WordDelimiterFilter.PRESERVE_ORIGINAL
                & WordDelimiterFilter.SPLIT_ON_CASE_CHANGE
                & WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE, null);

        stream = new LowerCaseFilter(version, stream);
        stream = new StopFilter(version, stream,
StopAnalyzer.ENGLISH_STOP_WORDS_SET);

        return new TokenStreamComponents(source, stream);
    }
}

//-------------------------------------------------

Performing a very simple test results in zero document found:

        Analyzer analyzer = new FieldAnalyzer(Version.LUCENE_40);
        Directory index = new RAMDirectory();

        String text = "spring-framework";
        String field = "field";

        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40,
analyzer);
        IndexWriter w = new IndexWriter(index, config);
        Document doc = new Document();
        doc.add(new TextField(field, text, Field.Store.YES));
        w.addDocument(doc);
        w.close();

        String querystr = "spring-framework";
        Query q = new AnalyzingQueryParser(Version.LUCENE_40, field,
analyzer).parse(querystr);
        int hitsPerPage = 10;

        IndexReader reader = DirectoryReader.open(index);
        IndexSearcher searcher = new IndexSearcher(reader);
        TopScoreDocCollector collector =
TopScoreDocCollector.create(hitsPerPage, true);
        searcher.search(q, collector);
        ScoreDoc[] hits = collector.topDocs().scoreDocs;

        System.out.println("Found " + hits.length + " hits.");
        for (int i = 0; i < hits.length; ++i) {
            int docId = hits[i].doc;
            Document d = searcher.doc(docId);
            System.out.println((i + 1) + ". " + d.get(field));
        }


Any idea what I've done wrong? If I comment out the addition of
WordDelimiterFilter - the search works.

Thanks in advance,

Jeremy

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message