lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernd Fehling <bernd.fehl...@uni-bielefeld.de>
Subject Re: accessing the query string from inside TokenFilter
Date Wed, 26 Oct 2011 12:49:21 GMT
Thanks Robert for pointing me to the issue. Thats exactly my problem
because I'm trying to implement "query time synonym expansion".
Therefore it is nessessary to "cleanup" the synonym result with help
of the query string.

Interestingly my FAST system calls synonym twice for query parsing:
...
synonym
parse
synonym
...

Would be pleased to have this fixed so that QueryParser is not also
a tokenizer, but while having looked into QueryParser (which scared
me to death) is it possible to be fixed at all without getting any
other bad side effects?

Using phrase query works so far for getting the complete query string
at once to the analyzer.


Am 26.10.2011 14:09, schrieb Robert Muir:
> Use a queryparser that doesnt break on whitespace as a workaround?
> Or, we can start thinking about how to fix QueryParser
> (https://issues.apache.org/jira/browse/LUCENE-2605)
>
> The bug is that QueryParser tries to be a Tokenizer and breaks on whitespace.
> Allowing tokenizer access to the query string would just mean that
> your tokenizer hacks around this by trying to be a QueryParser, too,
> making matters even worse!
>
>
> On Wed, Oct 26, 2011 at 8:05 AM, Bernd Fehling
> <bernd.fehling@uni-bielefeld.de>  wrote:
>> OK, I think "query string" is a bit to specific, so more general
>> what I need is access from inside of a filter to the complete string
>> (not only token) being analyzed.
>>
>> A very dirty workaround would be a "collector filter" which collects all
>> tokens after WhitespaceTokenizer and makes it somehow available for
>> the following filters, or not?
>> So at least at the last run of incrementToken() I have the original string.
>>
>> Bernd
>>
>> Am 26.10.2011 10:26, schrieb Uwe Schindler:
>>>
>>> The input from StringReader does not help you:
>>> - in the case of QueryParser it is *not* the query string!!!
>>> - storing it in an attribute would blow up your heap for real documents
>>>
>>> Uwe
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: uwe@thetaphi.de
>>>
>>>
>>>> -----Original Message-----
>>>> From: Bernd Fehling [mailto:bernd.fehling@uni-bielefeld.de]
>>>> Sent: Wednesday, October 26, 2011 10:06 AM
>>>> To: dev@lucene.apache.org
>>>> Subject: Re: accessing the query string from inside TokenFilter
>>>>
>>>>   From what I can see in the debugger the analyzer chain is implemented as
>>>
>>> a
>>>>
>>>> stack with last filter at the bottom and the first filter at the top.
>>>>
>>>> An analyzer query chain of:
>>>> charFilter: MappingCharFilterFactory
>>>> tokenizer : WhitespaceTokenizerFactory
>>>> filter    : PatternReplaceFilterFactory
>>>> filter    : LowerCaseFilterFactory
>>>> filter    : ShingleFilterFactory
>>>> filter    : SynonymFilterFactory
>>>>
>>>> has a chain of:
>>>> this.input(SynonymFilter) -->    input(ShingleFilter) -->
>>>> input(LowerCaseFilter) -->    input(PatternReplaceFilter) -->
>>>> input(WhitespaceTokenizer) -->    input(MappingCharFilter) -->
>>>> input(CharReader) -->    input(StringReader).str
>>>>
>>>> So I can always "see" the input of StringReader, but can I access it?
>>>>
>>>> Bernd
>>>>
>>>> Am 26.10.2011 09:37, schrieb Chris Male:
>>>>>
>>>>> We've also lost the full query string by the time the QP creates its
>>>>> TokenStream, right? Because the QP tokenizes on whitespace.
>>>>>
>>>>> On Wed, Oct 26, 2011 at 8:32 PM, Uwe Schindler<uwe@thetaphi.de>
    wrote:
>>>>>
>>>>>> Hi Simon,
>>>>>>
>>>>>> The problem is the xchanged consumer/producer role. Once the
>>>>>> TokenStream calls clearAttributes() the attributes are gone, but
>>>>>> query parser can only set the attribute *before* calling
>>>>>> incrementToken(), so you have no chance to get them, as Tokenizer
>>>>>> cleared it before any filter can read it (unless we use an attribute
>>>>>> with clear() a no-op, which would fail lots of tests, as it's a hack).
>>>>>>
>>>>>> Uwe
>>>>>>
>>>>>> -----
>>>>>> Uwe Schindler
>>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>>>>> http://www.thetaphi.de
>>>>>> eMail: uwe@thetaphi.de
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Simon Willnauer [mailto:simon.willnauer@googlemail.com]
>>>>>>> Sent: Wednesday, October 26, 2011 9:21 AM
>>>>>>> To: dev@lucene.apache.org
>>>>>>> Subject: Re: accessing the query string from inside TokenFilter
>>>>>>>
>>>>>>> What Uwe says is correct though. What we possibly could do is
adding
>>>>>>> a queryattribute that is set in a query parser (you can do that
>>>>>>> yourself
>>>>>>
>>>>>> though).
>>>>>>>
>>>>>>> not sure if it is worth it and if we should do it.
>>>>>>>
>>>>>>> simon
>>>>>>>
>>>>>>> On Wed, Oct 26, 2011 at 8:58 AM, Uwe Schindler<uwe@thetaphi.de>
>>>>
>>>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> QueryParser and TokenStreams are clearly separated, there
is no way
>>>>>>>> to get the query string from inside a TokenStream (and there
cannot
>>>>>>>> be, because QP is a consumer of the TS, which is used not
only for
>>>>>>>> query parsing). The only chance you have is to use a ThreadLocal
>>>>>>>> that you set before the query is parsed and then use it in
the
>>>
>>> TokenFilter.
>>>>>>>>
>>>>>>>> Uwe
>>>>>>>>
>>>>>>>> -----
>>>>>>>> Uwe Schindler
>>>>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
>>>>>>>> eMail: uwe@thetaphi.de
>>>>>>>>
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Bernd Fehling [mailto:bernd.fehling@uni-bielefeld.de]
>>>>>>>>> Sent: Wednesday, October 26, 2011 8:33 AM
>>>>>>>>> To: dev@lucene.apache.org
>>>>>>>>> Subject: accessing the query string from inside TokenFilter
>>>>>>>>>
>>>>>>>>> Dear list,
>>>>>>>>> while writing some TokenFilter for my analyzer chain
I need access
>>>
>>> to
>>>>>>>>>
>>>>>>>>> the
>>>>>>>>
>>>>>>>> query
>>>>>>>>>
>>>>>>>>> string from inside of my TokenFilter for some comparison,
but the
>>>>>>>>> Filters
>>>>>>>>
>>>>>>>> are
>>>>>>>>>
>>>>>>>>> working with a TokenStream and get seperate Tokens.
>>>>>>>>> Currently I couldn't get any access to the query string.
>>>>>>>>>
>>>>>>>>> It would be great to have such a funtionality in lucene/solr.
>>>>>>>>>
>>>>>>>>> Should I write a jira issue for it or is there somewhere
a wish
>>>
>>> list?
>>>>>>>>>
>>>>>>>>> Best regards
>>>>>>>>> Bernd
>>>>>>>>>
>>>>>>>>>
>>> ---------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For
>>>>>>>>> additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For
>>>>>>>> additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
>>>
>>> additional
>>>>>>>
>>>>>>> commands, e-mail: dev-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> *************************************************************
>>>> Bernd Fehling                Universitätsbibliothek Bielefeld
>>>> Dipl.-Inform. (FH)                        Universitätsstr. 25
>>>> Tel. +49 521 106-4060                   Fax. +49 521 106-4052
>>>> bernd.fehling@uni-bielefeld.de                33615 Bielefeld
>>>>
>>>> BASE - Bielefeld Academic Search Engine - www.base-search.net
>>>> *************************************************************
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>
>> --
>> *************************************************************
>> Bernd Fehling                Universitätsbibliothek Bielefeld
>> Dipl.-Inform. (FH)                        Universitätsstr. 25
>> Tel. +49 521 106-4060                   Fax. +49 521 106-4052
>> bernd.fehling@uni-bielefeld.de                33615 Bielefeld
>>
>> BASE - Bielefeld Academic Search Engine - www.base-search.net
>> *************************************************************
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
>
>

-- 
*************************************************************
Bernd Fehling                Universitätsbibliothek Bielefeld
Dipl.-Inform. (FH)                        Universitätsstr. 25
Tel. +49 521 106-4060                   Fax. +49 521 106-4052
bernd.fehling@uni-bielefeld.de                33615 Bielefeld

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message