lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: Which token filter can combine 2 terms into 1?
Date Fri, 21 Dec 2012 22:44:28 GMT
You still have the query parser's parsing before analysis to deal with, no 
matter what magic you code in your analyzer.

-- Jack Krupansky

-----Original Message----- 
From: Tom
Sent: Friday, December 21, 2012 2:24 PM
To: java-user@lucene.apache.org
Subject: Re: Which token filter can combine 2 terms into 1?

On Fri, Dec 21, 2012 at 9:16 AM, Jack Krupansky 
<jack@basetechnology.com>wrote:

> And to be more specific, most query parsers will have already separated
> the terms and will call the analyzer with only one term at a time, so no
> term recombination is possible for those parsed terms, at query time.
>
Most analyzers will do that, yes. But if Xi writes his own analyzer with
his own combiner filter, then he should also use this for query generation
and thus get the desired combinations / snippets there as well.

Xi, here is the recipe:
- SnippetFilter extends TokenFilter
-SnippetFilter  needs access to your lexicon: a data structure to store
your snippets. In the general case this is a tree, and going along a branch
will tell you whenever a valid snipped has been built or if the snipped
could be longer. (Example: "internal revenue" can be one snippet but,
depending on the next token, a larger snipped of "internal revenue service"
could be built.)
- Logic of the SnippetFilter.incrementToken() goes something like this: You
need a loop which retrieves tokens from the input variable until the input
is empty. You store each retrieved token in a variable(s) x in
SnippetFilter . As long as you have a potential match against your lexicon,
you can continue in this loop. Once you realize that there is something
within x which can not possibly become a (longer) snippet, break out of the
loop and allow the consumer to retrieve it.
- make sure your analyzer inserts SnippetFilter at the correct spot in the
filter chain.

Cheers
FiveMileTom





>
> -- Jack Krupansky
> -----Original Message----- From: Erick Erickson
> Sent: Friday, December 21, 2012 8:27 AM
> To: java-user
> Subject: Re: Which token filter can combine 2 terms into 1?
>
>
> If it's a fixed list and not excessively long, would synonyms work?
>
> But if theres some kind of logic you need to apply, I don't think you're
> going to find anything OOB.
> The problem is that by the time a token filter gets called, they are
> already split up, you'll probably
> have to write a custom filter that manages that logic.
>
> Best
> Erick
>
>
> On Fri, Dec 21, 2012 at 4:16 AM, Xi Shen <davidshen84@gmail.com> wrote:
>
>  Unfortunately, no...I am not combine every two term into one. I am
>> combining a specific pair.
>>
>> E.g. the Token Stream: t1 t2 t2a t3
>> should be rewritten into t1 t2t2a t3
>>
>> But the TS: t1 t2 t3 t2a
>> should not be rewritten, and it is already correct
>>
>>
>> On Fri, Dec 21, 2012 at 5:00 PM, Alan Woodward <
>> alan.woodward@romseysoftware.**co.uk 
>> <alan.woodward@romseysoftware.co.uk>>
>> wrote:
>>
>> > Have a look at ShingleFilter:
>> >
>> http://lucene.apache.org/core/**3_6_0/api/all/org/apache/**
>> lucene/analysis/shingle/**ShingleFilter.html<http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html>
>> >
>> > On 21 Dec 2012, at 08:42, Xi Shen wrote:
>> >
>> > > I have to use the white space and word delimiter to process the input
>> > > first. I tried many combination, and it seems to me that it is
>> inevitable
>> > > the term will be split into two :(
>> > >
>> > > I think developing my own filter is the only resolution...but I just
>> > cannot
>> > > find a guide to help me understand what I need to do to implement a
>> > > TokenFilter.
>> > >
>> > >
>> > > On Fri, Dec 21, 2012 at 4:03 PM, Danil ŢORIN <torindan@gmail.com>
>> wrote:
>> > >
>> > >> Easiest way would be to pre-process your input and join those 2 >
>>
>> tokens
>> > >> before splitting them by white space.
>> > >>
>> > >> But from given context I might miss some details...still worth a >
>> >> shot.
>> > >>
>> > >> On Fri, Dec 21, 2012 at 9:50 AM, Xi Shen <davidshen84@gmail.com>
>> wrote:
>> > >>
>> > >>> Hi,
>> > >>>
>> > >>> I am looking for a token filter that can combine 2 terms into 1?
>
>> >>> E.g.
>> > >>>
>> > >>> the input has been tokenized by white space:
>> > >>>
>> > >>> t1 t2 t2a t3
>> > >>>
>> > >>> I want a filter that output:
>> > >>>
>> > >>> t1 t2t2a t3
>> > >>>
>> > >>> I know it is a very special case, and I am thinking about develop
a
>> > >> filter
>> > >>> of my own. But I cannot figure out which API I should use to look
>
>> >>> for
>> > >> terms
>> > >>> in a Token Stream.
>> > >>>
>> > >>> --
>> > >>> Regards,
>> > >>> David Shen
>> > >>>
>> > >>> http://about.me/davidshen
>> > >>> https://twitter.com/#!/**davidshen84<https://twitter.com/#!/davidshen84>
>> > >>>
>> > >>
>> > >
>> > >
>> > >
>> > > --
>> > > Regards,
>> > > David Shen
>> > >
>> > > http://about.me/davidshen
>> > > https://twitter.com/#!/**davidshen84<https://twitter.com/#!/davidshen84>
>> >
>> >
>>
>>
>> --
>> Regards,
>> David Shen
>>
>> http://about.me/davidshen
>> https://twitter.com/#!/**davidshen84 <https://twitter.com/#!/davidshen84>
>>
>>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: 
> java-user-unsubscribe@lucene.**apache.org<java-user-unsubscribe@lucene.apache.org>
> For additional commands, e-mail: 
> java-user-help@lucene.apache.**org<java-user-help@lucene.apache.org>
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message