lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan <>
Subject Query over field with multiple forward slashes
Date Wed, 22 Feb 2017 20:27:16 GMT
My team has been using Lucene.NET for a number of years to search over text
extracted from files based on user input of search terms. However, we've
recently encountered an issue reported by a customer where searching for
terms with multiple forward slashes is not returning matches.

An example would be an indexed value of SB/ABC/1234-123 and the user inputs
SB/* to match all documents with that prefix. However, no results are
returned based on that query. The odd part is that searching for ABC/* does
return the document with value SB/ABC/1234-123, completely ignoring the SB/
component in the indexed value.

Initially the problem reported was with a combination of a forward slash
and wildcard (SB/* wouldn't return matches for SB/1234-123) but that was
addressed by adding a QueryParser with a KeywordAnalyzer in addition to the
previous QueryParser with only a StandardAnalyzer.

Here is the current code (simplified to the key elements that can reproduce
the issue) being used.

var reader = IndexReader.Open(FSDirectory.Open(new
DirectoryInfo(indexPath)), true);var searcher = new
IndexSearcher(reader);var mainQuery = new BooleanQuery();
// The analyzer and parser for searching the index fields with full
stop-words and tokenizersvar fieldAnalyzer = new
StandardAnalyzer(LuceneVersion);var fieldParser = new
// The analyzer and parser for searching the index fields using no
stop words or tokenizersvar fieldKeywordAnalyzer = new
KeywordAnalyzer();var fieldKeywordParser = new
// Build and append the Standard and Keyword query clauses together
for the whole field value query to pick up all relevant resultsvar
fieldQuery = fieldParser.Parse(textCriteria);var fieldKeywordQuery =
var fieldBooleanQuery = new BooleanQuery{
    {fieldQuery, Occur.SHOULD},
    {fieldKeywordQuery, Occur.SHOULD}};

mainQuery.Add(fieldBooleanQuery, Occur.MUST);
var hits = searcher.Search(mainQuery, reader.NumDocs());

The actual parsed query within mainQuery at the time of the searcher.Search
call is +((Title:sb/abc/*) (Title:sb/abc/*)). In this case, both clauses of
the BooleanQuery happen to be the same. The Luke tool commonly used for
working with Lucene indexes seems to think this is invalid syntax when
using a KeywordAnalyzer (ignoring the tokenized aspect for the moment):

Cannot parse '+((Title:sb/abc/*) (Title:sb/abc/*))': '*' or '?' not allowed
as first character in WildcardQuery.

My guess is that having the two forward slashes is making it treat it sort
of like a regex. The question is how do we get it to match the results
correctly? Escaping the slashes in the search criteria didn't change the
parsed query seen above or the returned results.

Our current requirement is that it must support tokenized/stop word
searching (for text phrases, etc) as well as exact matches (we store a lot
of invoice numbers, etc that should not be tokenized) on the same fields
simultaneously and both handling wildcards. The SB/* query is an example of
a wildcard search over an exact match value scenario.

Hopefully this makes sense. I can add additional clarification if needed.



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message