lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <simon.willna...@googlemail.com>
Subject Re: custom attributs in tokens
Date Thu, 25 Nov 2010 15:56:29 GMT
On Thu, Nov 25, 2010 at 3:25 PM, Jan Kurella <jan.kurella@nokia.com> wrote:
> Hi Simon,
>
> On 25.11.2010 10:40, ext Simon Willnauer wrote:
>>
>> Hi Jan,
>>
>> On Wed, Nov 24, 2010 at 9:12 AM,<jan.kurella@nokia.com>  wrote:
>>>
>>> Of course:
>>>
>>> We are trying to search in documents that contain text in several
>>> languages. We are also investigating other approaches*, so this is not about
>>> finding other variants.
>>> the goal is to only match tokens from 1 or more given languages and not
>>> to match the token if it is by accident the same in another language.
>>>
>>> For the payloads my plan is to add the correct language to each and every
>>> token during indexing (I'm not sure how to solve this best, but I'm sure
>>> this can be solved at least with lucene directly).
>>> On search side my current idea is to wrap around a TermPosition and skip
>>> all docs, where the current payload has not one of the requested languages.
>>> I probably need to use my own Query/Weight for this?
>>
>> You don't need to start from nothing here, I suggest you to look at
>> SpanTermQuery and TermSpans which uses DocsAndPositionsEnum (or rather
>> TermPositions in non-trunk versions). TermSpan gives you the ability
>> to override #next() and #skipTo() which is from what I understand what
>> you are looking for, right?
>
> Just to get it right: I only subclass the SpanTermQuery to verwrite the
> getSpans(Reader) method to return MyTermSpans().
> MyTermSpans are a subclass of TermSpans where I just extend #next() and
> #skipTo() to go further until my desired Payload is found.

that sounds about right...
>
> Sounds pretty easy and straight forward.
>>>
>>> Another approach would be to just overwrite the Similarity, but this will
>>> only influence scoring and depending on the underlying query not completely
>>> skip the token - I have to test the difference for the final score between
>>> this approaches.
>>
>> Well as you figured correctly this is rather for scoring really.
>
> So if I'm going to use the scoring stuff also, I rather subclass
> PayloadTermQuery then

hmm I am not a span expert but I guess that would make it easier though.
>>>
>>> This one blog made me curious if there is already something similar, that
>>> skips TermPositions based on given attributes? I could imagine something
>>> similar to the current Tokenattribute concept during index time, but also
>>> available during search and controlled by a similarity...
>>
>> Actually in lucene 4.0 each Flex-Enum has a AttributeSource that
>> allows you to add custom attributes to you enumerations. Yet there is
>> no logic that skips based on that though.
>>
>> Simon
>
> lucene 4.0 is  a little far away today? If the above approach performs good
> (and it sounds like it will) it should be good enough for now

i was just saying that this is on the way... and yeah you might need
to wait a bit until 4.0 :)

simon

>
> Jan
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message