lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: accessing the query string from inside TokenFilter
Date Wed, 26 Oct 2011 12:09:49 GMT
Use a queryparser that doesnt break on whitespace as a workaround?
Or, we can start thinking about how to fix QueryParser
(https://issues.apache.org/jira/browse/LUCENE-2605)

The bug is that QueryParser tries to be a Tokenizer and breaks on whitespace.
Allowing tokenizer access to the query string would just mean that
your tokenizer hacks around this by trying to be a QueryParser, too,
making matters even worse!


On Wed, Oct 26, 2011 at 8:05 AM, Bernd Fehling
<bernd.fehling@uni-bielefeld.de> wrote:
> OK, I think "query string" is a bit to specific, so more general
> what I need is access from inside of a filter to the complete string
> (not only token) being analyzed.
>
> A very dirty workaround would be a "collector filter" which collects all
> tokens after WhitespaceTokenizer and makes it somehow available for
> the following filters, or not?
> So at least at the last run of incrementToken() I have the original string.
>
> Bernd
>
> Am 26.10.2011 10:26, schrieb Uwe Schindler:
>>
>> The input from StringReader does not help you:
>> - in the case of QueryParser it is *not* the query string!!!
>> - storing it in an attribute would blow up your heap for real documents
>>
>> Uwe
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>>
>>> -----Original Message-----
>>> From: Bernd Fehling [mailto:bernd.fehling@uni-bielefeld.de]
>>> Sent: Wednesday, October 26, 2011 10:06 AM
>>> To: dev@lucene.apache.org
>>> Subject: Re: accessing the query string from inside TokenFilter
>>>
>>>  From what I can see in the debugger the analyzer chain is implemented as
>>
>> a
>>>
>>> stack with last filter at the bottom and the first filter at the top.
>>>
>>> An analyzer query chain of:
>>> charFilter: MappingCharFilterFactory
>>> tokenizer : WhitespaceTokenizerFactory
>>> filter    : PatternReplaceFilterFactory
>>> filter    : LowerCaseFilterFactory
>>> filter    : ShingleFilterFactory
>>> filter    : SynonymFilterFactory
>>>
>>> has a chain of:
>>> this.input(SynonymFilter) -->  input(ShingleFilter) -->
>>> input(LowerCaseFilter) -->  input(PatternReplaceFilter) -->
>>> input(WhitespaceTokenizer) -->  input(MappingCharFilter) -->
>>> input(CharReader) -->  input(StringReader).str
>>>
>>> So I can always "see" the input of StringReader, but can I access it?
>>>
>>> Bernd
>>>
>>> Am 26.10.2011 09:37, schrieb Chris Male:
>>>>
>>>> We've also lost the full query string by the time the QP creates its
>>>> TokenStream, right? Because the QP tokenizes on whitespace.
>>>>
>>>> On Wed, Oct 26, 2011 at 8:32 PM, Uwe Schindler<uwe@thetaphi.de>  
wrote:
>>>>
>>>>> Hi Simon,
>>>>>
>>>>> The problem is the xchanged consumer/producer role. Once the
>>>>> TokenStream calls clearAttributes() the attributes are gone, but
>>>>> query parser can only set the attribute *before* calling
>>>>> incrementToken(), so you have no chance to get them, as Tokenizer
>>>>> cleared it before any filter can read it (unless we use an attribute
>>>>> with clear() a no-op, which would fail lots of tests, as it's a hack).
>>>>>
>>>>> Uwe
>>>>>
>>>>> -----
>>>>> Uwe Schindler
>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>>>> http://www.thetaphi.de
>>>>> eMail: uwe@thetaphi.de
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Simon Willnauer [mailto:simon.willnauer@googlemail.com]
>>>>>> Sent: Wednesday, October 26, 2011 9:21 AM
>>>>>> To: dev@lucene.apache.org
>>>>>> Subject: Re: accessing the query string from inside TokenFilter
>>>>>>
>>>>>> What Uwe says is correct though. What we possibly could do is adding
>>>>>> a queryattribute that is set in a query parser (you can do that
>>>>>> yourself
>>>>>
>>>>> though).
>>>>>>
>>>>>> not sure if it is worth it and if we should do it.
>>>>>>
>>>>>> simon
>>>>>>
>>>>>> On Wed, Oct 26, 2011 at 8:58 AM, Uwe Schindler<uwe@thetaphi.de>
>>>
>>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> QueryParser and TokenStreams are clearly separated, there is
no way
>>>>>>> to get the query string from inside a TokenStream (and there
cannot
>>>>>>> be, because QP is a consumer of the TS, which is used not only
for
>>>>>>> query parsing). The only chance you have is to use a ThreadLocal
>>>>>>> that you set before the query is parsed and then use it in the
>>
>> TokenFilter.
>>>>>>>
>>>>>>> Uwe
>>>>>>>
>>>>>>> -----
>>>>>>> Uwe Schindler
>>>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
>>>>>>> eMail: uwe@thetaphi.de
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Bernd Fehling [mailto:bernd.fehling@uni-bielefeld.de]
>>>>>>>> Sent: Wednesday, October 26, 2011 8:33 AM
>>>>>>>> To: dev@lucene.apache.org
>>>>>>>> Subject: accessing the query string from inside TokenFilter
>>>>>>>>
>>>>>>>> Dear list,
>>>>>>>> while writing some TokenFilter for my analyzer chain I need
access
>>
>> to
>>>>>>>>
>>>>>>>> the
>>>>>>>
>>>>>>> query
>>>>>>>>
>>>>>>>> string from inside of my TokenFilter for some comparison,
but the
>>>>>>>> Filters
>>>>>>>
>>>>>>> are
>>>>>>>>
>>>>>>>> working with a TokenStream and get seperate Tokens.
>>>>>>>> Currently I couldn't get any access to the query string.
>>>>>>>>
>>>>>>>> It would be great to have such a funtionality in lucene/solr.
>>>>>>>>
>>>>>>>> Should I write a jira issue for it or is there somewhere
a wish
>>
>> list?
>>>>>>>>
>>>>>>>> Best regards
>>>>>>>> Bernd
>>>>>>>>
>>>>>>>>
>> ---------------------------------------------------------------------
>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For
>>>>>>>> additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
>>>>>>> additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
>>
>> additional
>>>>>>
>>>>>> commands, e-mail: dev-help@lucene.apache.org
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>
>>> --
>>> *************************************************************
>>> Bernd Fehling                Universitätsbibliothek Bielefeld
>>> Dipl.-Inform. (FH)                        Universitätsstr. 25
>>> Tel. +49 521 106-4060                   Fax. +49 521 106-4052
>>> bernd.fehling@uni-bielefeld.de                33615 Bielefeld
>>>
>>> BASE - Bielefeld Academic Search Engine - www.base-search.net
>>> *************************************************************
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
> --
> *************************************************************
> Bernd Fehling                Universitätsbibliothek Bielefeld
> Dipl.-Inform. (FH)                        Universitätsstr. 25
> Tel. +49 521 106-4060                   Fax. +49 521 106-4052
> bernd.fehling@uni-bielefeld.de                33615 Bielefeld
>
> BASE - Bielefeld Academic Search Engine - www.base-search.net
> *************************************************************
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>



-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message