Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of milindr@gmail.com designates
 209.85.215.43 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <EA1C5CB43E084664B10E7C3D0A8F7652@JackKrupansky14>
References: 
 <CAKEYXB-gZ-n6ZxVGxhrv+G6jWk77A2SAnTghJ6G-rf5sW4pjjQ@mail.gmail.com>
	<6E6C37B1A0EA40B38E85A23108C7B2B4@JackKrupansky14>
	<CAKEYXB8-bQV-JA8WehWvHqWkP--ysb8CevA6sp6SkJ-LtTt4=Q@mail.gmail.com>
	<53FDEA87.4070501@safaribooksonline.com>
	<EA1C5CB43E084664B10E7C3D0A8F7652@JackKrupansky14>
Date: Wed, 27 Aug 2014 10:55:56 -0400
Message-ID: 
 <CAKEYXB9YbnrHPHzkcWj4zgzZJycgDffjXKQKT3ukTr_DpAZBLw@mail.gmail.com>
Subject: Re: Why does this search fail?
From: Milind <milindr@gmail.com>
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=001a11c3556411b74905019d9e77

--001a11c3556411b74905019d9e77
Content-Type: text/plain; charset=UTF-8

Thanks Jack.  I'll try this out.  I'll have to see if that creates other
side effects :-(.  Tokenization is already causing a great deal of
confusion.  I want to make it as intuitive as possible.


On Wed, Aug 27, 2014 at 10:45 AM, Jack Krupansky <jack@basetechnology.com>
wrote:

> Yes, the white space tokenizer will preserve all punctuation, but... then
> the query for DevNm00* will fail. A "smarter" set of filters is probably
> needed here... start with white space tokenization, keep that overall
> token, then trim external punctuation and keep that token as well, and then
> use word delimiter filter to split out the embedded words, like DevNm00,
> and add them.
>
> The word delimiter filter will do most of that, but not the part of
> trimming out external punctuation. But depending on your use case, it may
> be close enough.
>
> See:
> http://lucene.apache.org/core/4_9_0/analyzers-common/org/
> apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html
>
> -- Jack Krupansky
>
> -----Original Message----- From: Michael Sokolov
> Sent: Wednesday, August 27, 2014 10:26 AM
> To: java-user@lucene.apache.org
> Subject: Re: Why does this search fail?
>
>
> Tokenization is tricky.  You might  consider using whitespace tokenizer
> followed by word delimiter filter (instead of standard tokenizer); it
> does a kind of secondary tokenization pass that can preserve the
> original token in addition to its component parts. There are some weird
> side effects to do with term frequencies and phrase-like queries, but it
> would make all these wildcard queries work I think.
>
> -Mike
>
> On 08/27/2014 09:54 AM, Milind wrote:
>
>> I see.  This is going to be extremely difficult to explain to end users.
>> It doesn't work as they would expect.  Some of the tokenizing rules are
>> already somewhat confusing.  Their expectation is that it should work the
>> way their searches work in Google.
>>
>> It's difficult enough to recognize that because the period is surrounded
>> by
>> a digit and alphabet (as opposed to 2 digits or 2 alphabets), it gets
>> tokenized.  So I'd have expected that C0001.DevNm00* would effectively
>> become a search for C0001 OR DevNm00*.  But now, because of the presence
>> of
>> the wildcard, it's considered as 1 term and the period is not a tokenizer.
>> That's actually good, but now the fact that it's still considered as 2
>> terms for wildcard searches makes it very unintuitive.  I don't suppose
>> that I can do anything about making wildcard search use multiple terms if
>> joined together with a tokenizer.  But is there any way that I can force
>> it
>> to go through an analyzer prior to doing the search?
>>
>>
>>
>>
>> On Tue, Aug 26, 2014 at 4:21 PM, Jack Krupansky <jack@basetechnology.com>
>> wrote:
>>
>>  Sorry, but you can only use a wildcard on a single term. "C0001.DevNm001"
>>> gets indexed as two terms, "c0001" and "devnm001", so your wildcard won't
>>> match any term (at least in this case.)
>>>
>>> Also, if your query term includes a wildcard, it will not be fully
>>> analyzed. Some filters such as lower case are defined as "multi-term", so
>>> they will be performed, but the standard tokenizer is not being called,
>>> so
>>> the dot remains and this whole term is treated as one term, unlike the
>>> index analysis.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Milind
>>> Sent: Tuesday, August 26, 2014 12:24 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Why does this search fail?
>>>
>>>
>>> I have a field with the value C0001.DevNm001.  If I search for
>>>
>>>     C0001.DevNm001 --> Get Hit
>>>     DevNm00*       --> Get Hit
>>>     C0001.DevNm00*  --> Get No Hit
>>>
>>> The field gets tokenized on the period since it's surrounded by a letter
>>> and and a number.  The query gets evaluated as a prefix query.  I'd have
>>> thought that this should have found the document.  Any clues on why this
>>> doesn't work?
>>>
>>> The full code is below.
>>>
>>>         Directory theDirectory = new RAMDirectory();
>>>         Version theVersion = Version.LUCENE_47;
>>>         Analyzer theAnalyzer = new StandardAnalyzer(theVersion);
>>>         IndexWriterConfig theConfig =
>>>                             new IndexWriterConfig(theVersion,
>>> theAnalyzer);
>>>         IndexWriter theWriter = new IndexWriter(theDirectory, theConfig);
>>>
>>>         String theFieldName = "Name";
>>>         String theFieldValue = "C0001.DevNm001";
>>>           Document theDocument = new Document();
>>>           theDocument.add(new TextField(theFieldName, theFieldValue,
>>> Field.Store.YES));
>>>           theWriter.addDocument(theDocument);
>>>         theWriter.close();
>>>
>>>         String theQueryStr = theFieldName + ":C0001.DevNm00*";
>>>         Query theQuery =
>>>             new QueryParser(theVersion, theFieldName,
>>> theAnalyzer).parse(theQueryStr);
>>>         System.out.println(theQuery.getClass() + ", " + theQuery);
>>>         IndexReader theIndexReader = DirectoryReader.open(theDirectory);
>>>         IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
>>>         TopScoreDocCollector collector = TopScoreDocCollector.create(10,
>>> true);
>>>         theSearcher.search(theQuery, collector);
>>>         ScoreDoc[] theHits = collector.topDocs().scoreDocs;
>>>         System.out.println("Hits found: " + theHits.length);
>>>
>>> Output:
>>>
>>> class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00*
>>> Hits found: 0
>>>
>>>
>>> --
>>> Regards
>>> Milind
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Regards
Milind

--001a11c3556411b74905019d9e77--