lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elmer van Chastelet <evanchaste...@gmail.com>
Subject Re: PhoneticFilterFactory 's inject parameter
Date Wed, 25 Apr 2012 15:02:45 GMT
Thanks for your suggestion Ian, but I just found out that if I replace 
the KeywordTokenizer with a WhitespaceTokenizer, all seems to work fine.

Just to test what happens, I created another field 'orig', using this 
analyzer:
analyzer KeywordLowered{
     tokenizer = KeywordTokenizer
     tokenfilter = LowerCaseFilter
}

Guess what.. exactly the same problem, also in Luke.
It finds no documents with for query:
orig:strange
While the term 'strange' is in the index for the field 'orig'.

Does anybody have a clue why documents are not matched when using the 
KeywordTokenizer? Remember that all queries and terms don't contain 
white spaces.


Thanks again.
-Elmer


On 04/25/2012 02:53 PM, Ian Lea wrote:
> You seem to be quietly going round in circles, by yourself!  I suggest
> a small self-contained program/test case with a RAM index created from
> scratch.  You can then experiment with inject on or off and if you
> still can't figure it out, post the code and hopefully someone will be
> able to help you make sense of it.
>
> Make sure you tell us what version of Lucene you are using.  If not
> the latest, wouldn't hurt to try with the latest.
>
>
> --
> Ian.
>
>
> On Wed, Apr 25, 2012 at 1:22 PM, Elmer van Chastelet
> <evanchastelet@gmail.com>  wrote:
>> I keep replying to myself, it all gets a bit confusing.
>> The problem still exists and I don't understand why, and why it worked once.
>>
>> I have the same behavior again as posted in my first mail:
>> - Inject parameter is set to true.
>> - The index has _no deleted documents_ and is optimized.
>> - The term 'compete' is in there.
>> - If I ask Luke to show all docs for term 'compete' it shows me the one and
>> only document that represents this word. But...
>> - If I perform the query 'value:compete' in luke again, it says there are no
>> results.
>>
>> Here is the index I'm currently using. It contains various fields for the
>> available phonetic filter encoders:
>> https://www.box.com/s/34212e82227e102f6734
>>
>> Can somebody explain this behavior? What's the real use of the inject
>> parameter of the PhoneticFilterFactory?
>>
>> Thanks in advance.
>>
>> -Elmer
>>
>>
>> On 04/25/2012 12:25 PM, Elmer van Chastelet wrote:
>>> Problem solved. Long story short: for some reason I had deleted documents
>>> in the index and the non-deleted documents used the phonetic filter with
>>> inject set to false.
>>>
>>> Works fine now :)
>>>
>>> On 04/23/2012 09:27 PM, Elmer van Chastelet wrote:
>>>> Hi all,
>>>>
>>>> (scroll to bottom for question)
>>>>
>>>> I was setting up a simple web app to play around with phonetic filters.
>>>> The idea is simple, I just create a document for each word in the English
>>>> dictionary, each document containing a single search field holding the value
>>>> after it is preprocessed using the following analyzer def (in our own dsl
>>>> syntax, which gets transformed to java):
>>>>
>>>> analyzer soundslike{
>>>>     tokenizer = KeywordTokenizer
>>>>     tokenfilter = LowerCaseFilter
>>>>     tokenfilter = PhoneticFilter(encoder="DoubleMetaphone", inject="true")
>>>> }
>>>>
>>>> I can run the web app and I get results that indeed (in some way) sound
>>>> like the original query term.
>>>>
>>>> But what confuses me is the ranking of the results, knowing that I set
>>>> the inject param to true. If I search for the query term 'compete', the
>>>> parsed query becomes '(value:KMPT value:compete)', and therefore I expect
>>>> the word 'compete' to be ranked highest in the list than any other word....
>>>> but this wasn't the case.
>>>>
>>>> Looking further at the explanation of results, I saw that the term
>>>> 'compete' in the parsed query is totally absent, and only the phonetic
>>>> encoding seems affect the ranking:
>>>>
>>>>   * COMPETITOR
>>>>       o 4.368826 = (MATCH) sum of:
>>>>           + 4.368826 = (MATCH) weight(value:KMPT in 3174), product of:
>>>>               # 0.52838135 = queryWeight(value:KMPT), product of:
>>>>                   * 8.26832 = idf(docFreq=150, maxDocs=216555)
>>>>                   * 0.063904315 = queryNorm
>>>>               # 8.26832 = (MATCH) fieldWeight(value:KMPT in 3174),
>>>>                 product of:
>>>>                   * 1.0 = tf(termFreq(value:KMPT)=1)
>>>>                   * 8.26832 = idf(docFreq=150, maxDocs=216555)
>>>>                   * 1.0 = fieldNorm(field=value, doc=3174)
>>>>
>>>> The next thing I did was running our friend Luke. In Luke, I opened the
>>>> documents tab, and started iterating over some terms for the field 'value'
>>>> until I found 'compete'. When I hit 'Show All Docs', the search tab opens
>>>> and it displays the one and only document holding this value (i.e. the
>>>> document representing the word 'compete'). It shows the query:
>>>> 'value:compete '. Then, when I hit the search button again (query is still
>>>> 'value:compete '), it says that there are no results !?
>>>>
>>>> Probably, the 'Show All Docs' button does something different than
>>>> performing a query using the search tab in Luke.
>>>>
>>>> Q: Can somebody explain why the injected original terms seem to get
>>>> ignored at query time? Or may it be related to the name of the search field
>>>> ('value'), or something else?
>>>>
>>>> We use Lucene 3.1 with SOLR analyzers (by Hibernate Search 3.4.2).
>>>>
>>>> -Elmer
>>>>
>>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message