lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Long <jeremy.l...@gmail.com>
Subject Re: WordDelimiterFilter Question (lucene 4.0)
Date Sun, 23 Dec 2012 19:30:02 GMT
Have you ever wished you could retract your question to a mailing list? And
for anyone that read my question - yes, I do know the difference between a
bitwise "and" and a bitwise "or" and how they should be used when combining
flags... Sorry for the spam.

--Jeremy

On Sun, Dec 23, 2012 at 11:56 AM, Jeremy Long <jeremy.long@gmail.com> wrote:

> Hello,
>
> I'm having an issue creating a custom analyzer utilizing the
> WordDelimiterFilter. I'm attempting to create an index of information
> gleaned from JAR manifest files. So if I have "spring-framework" I need the
> following tokens indexed: "spring" "springframework" "framework"
> "spring-framework". My understanding is that the WordDelimiterFilter is
> perfect for this. However, when I introduce the filter to the analyzer I
> don't seem to get any documents indexed correctly.
>
> Here is the analyzer:
>
> import java.io.Reader;
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.Tokenizer;
> import org.apache.lucene.analysis.core.WhitespaceTokenizer;
> import org.apache.lucene.analysis.core.LowerCaseFilter;
> import org.apache.lucene.analysis.core.StopAnalyzer;
> import org.apache.lucene.analysis.core.StopFilter;
> import org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter;
> import org.apache.lucene.util.Version;
>
> public class FieldAnalyzer extends Analyzer {
>
>     private Version version = null;
>
>     public FieldAnalyzer(Version version) {
>         this.version = version;
>     }
>
>     @Override
>     protected TokenStreamComponents createComponents(String fieldName,
> Reader reader) {
>
>         Tokenizer source = new WhitespaceTokenizer(version, reader);
>         TokenStream stream = source;
>
>         stream = new WordDelimiterFilter(stream,
>                 WordDelimiterFilter.CATENATE_WORDS
>                 & WordDelimiterFilter.GENERATE_WORD_PARTS
>                 & WordDelimiterFilter.PRESERVE_ORIGINAL
>                 & WordDelimiterFilter.SPLIT_ON_CASE_CHANGE
>                 & WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE, null);
>
>         stream = new LowerCaseFilter(version, stream);
>         stream = new StopFilter(version, stream,
> StopAnalyzer.ENGLISH_STOP_WORDS_SET);
>
>         return new TokenStreamComponents(source, stream);
>     }
> }
>
> //-------------------------------------------------
>
> Performing a very simple test results in zero document found:
>
>         Analyzer analyzer = new FieldAnalyzer(Version.LUCENE_40);
>         Directory index = new RAMDirectory();
>
>         String text = "spring-framework";
>         String field = "field";
>
>         IndexWriterConfig config = new
> IndexWriterConfig(Version.LUCENE_40, analyzer);
>         IndexWriter w = new IndexWriter(index, config);
>         Document doc = new Document();
>         doc.add(new TextField(field, text, Field.Store.YES));
>         w.addDocument(doc);
>         w.close();
>
>         String querystr = "spring-framework";
>         Query q = new AnalyzingQueryParser(Version.LUCENE_40, field,
> analyzer).parse(querystr);
>         int hitsPerPage = 10;
>
>         IndexReader reader = DirectoryReader.open(index);
>         IndexSearcher searcher = new IndexSearcher(reader);
>         TopScoreDocCollector collector =
> TopScoreDocCollector.create(hitsPerPage, true);
>         searcher.search(q, collector);
>         ScoreDoc[] hits = collector.topDocs().scoreDocs;
>
>         System.out.println("Found " + hits.length + " hits.");
>         for (int i = 0; i < hits.length; ++i) {
>             int docId = hits[i].doc;
>             Document d = searcher.doc(docId);
>             System.out.println((i + 1) + ". " + d.get(field));
>         }
>
>
> Any idea what I've done wrong? If I comment out the addition of
> WordDelimiterFilter - the search works.
>
> Thanks in advance,
>
> Jeremy
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message