lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <msoko...@safaribooksonline.com>
Subject Re: Why does this search fail?
Date Wed, 27 Aug 2014 14:26:15 GMT
Tokenization is tricky.  You might  consider using whitespace tokenizer 
followed by word delimiter filter (instead of standard tokenizer); it 
does a kind of secondary tokenization pass that can preserve the 
original token in addition to its component parts. There are some weird 
side effects to do with term frequencies and phrase-like queries, but it 
would make all these wildcard queries work I think.

-Mike

On 08/27/2014 09:54 AM, Milind wrote:
> I see.  This is going to be extremely difficult to explain to end users.
> It doesn't work as they would expect.  Some of the tokenizing rules are
> already somewhat confusing.  Their expectation is that it should work the
> way their searches work in Google.
>
> It's difficult enough to recognize that because the period is surrounded by
> a digit and alphabet (as opposed to 2 digits or 2 alphabets), it gets
> tokenized.  So I'd have expected that C0001.DevNm00* would effectively
> become a search for C0001 OR DevNm00*.  But now, because of the presence of
> the wildcard, it's considered as 1 term and the period is not a tokenizer.
> That's actually good, but now the fact that it's still considered as 2
> terms for wildcard searches makes it very unintuitive.  I don't suppose
> that I can do anything about making wildcard search use multiple terms if
> joined together with a tokenizer.  But is there any way that I can force it
> to go through an analyzer prior to doing the search?
>
>
>
>
> On Tue, Aug 26, 2014 at 4:21 PM, Jack Krupansky <jack@basetechnology.com>
> wrote:
>
>> Sorry, but you can only use a wildcard on a single term. "C0001.DevNm001"
>> gets indexed as two terms, "c0001" and "devnm001", so your wildcard won't
>> match any term (at least in this case.)
>>
>> Also, if your query term includes a wildcard, it will not be fully
>> analyzed. Some filters such as lower case are defined as "multi-term", so
>> they will be performed, but the standard tokenizer is not being called, so
>> the dot remains and this whole term is treated as one term, unlike the
>> index analysis.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Milind
>> Sent: Tuesday, August 26, 2014 12:24 PM
>> To: java-user@lucene.apache.org
>> Subject: Why does this search fail?
>>
>>
>> I have a field with the value C0001.DevNm001.  If I search for
>>
>>     C0001.DevNm001 --> Get Hit
>>     DevNm00*       --> Get Hit
>>     C0001.DevNm00*  --> Get No Hit
>>
>> The field gets tokenized on the period since it's surrounded by a letter
>> and and a number.  The query gets evaluated as a prefix query.  I'd have
>> thought that this should have found the document.  Any clues on why this
>> doesn't work?
>>
>> The full code is below.
>>
>>         Directory theDirectory = new RAMDirectory();
>>         Version theVersion = Version.LUCENE_47;
>>         Analyzer theAnalyzer = new StandardAnalyzer(theVersion);
>>         IndexWriterConfig theConfig =
>>                             new IndexWriterConfig(theVersion, theAnalyzer);
>>         IndexWriter theWriter = new IndexWriter(theDirectory, theConfig);
>>
>>         String theFieldName = "Name";
>>         String theFieldValue = "C0001.DevNm001";
>>           Document theDocument = new Document();
>>           theDocument.add(new TextField(theFieldName, theFieldValue,
>> Field.Store.YES));
>>           theWriter.addDocument(theDocument);
>>         theWriter.close();
>>
>>         String theQueryStr = theFieldName + ":C0001.DevNm00*";
>>         Query theQuery =
>>             new QueryParser(theVersion, theFieldName,
>> theAnalyzer).parse(theQueryStr);
>>         System.out.println(theQuery.getClass() + ", " + theQuery);
>>         IndexReader theIndexReader = DirectoryReader.open(theDirectory);
>>         IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
>>         TopScoreDocCollector collector = TopScoreDocCollector.create(10,
>> true);
>>         theSearcher.search(theQuery, collector);
>>         ScoreDoc[] theHits = collector.topDocs().scoreDocs;
>>         System.out.println("Hits found: " + theHits.length);
>>
>> Output:
>>
>> class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00*
>> Hits found: 0
>>
>>
>> --
>> Regards
>> Milind
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message