lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Which token filter can combine 2 terms into 1?
Date Fri, 21 Dec 2012 13:27:57 GMT
If it's a fixed list and not excessively long, would synonyms work?

But if theres some kind of logic you need to apply, I don't think you're
going to find anything OOB.
The problem is that by the time a token filter gets called, they are
already split up, you'll probably
have to write a custom filter that manages that logic.

Best
Erick


On Fri, Dec 21, 2012 at 4:16 AM, Xi Shen <davidshen84@gmail.com> wrote:

> Unfortunately, no...I am not combine every two term into one. I am
> combining a specific pair.
>
> E.g. the Token Stream: t1 t2 t2a t3
> should be rewritten into t1 t2t2a t3
>
> But the TS: t1 t2 t3 t2a
> should not be rewritten, and it is already correct
>
>
> On Fri, Dec 21, 2012 at 5:00 PM, Alan Woodward <
> alan.woodward@romseysoftware.co.uk> wrote:
>
> > Have a look at ShingleFilter:
> >
> http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html
> >
> > On 21 Dec 2012, at 08:42, Xi Shen wrote:
> >
> > > I have to use the white space and word delimiter to process the input
> > > first. I tried many combination, and it seems to me that it is
> inevitable
> > > the term will be split into two :(
> > >
> > > I think developing my own filter is the only resolution...but I just
> > cannot
> > > find a guide to help me understand what I need to do to implement a
> > > TokenFilter.
> > >
> > >
> > > On Fri, Dec 21, 2012 at 4:03 PM, Danil ŢORIN <torindan@gmail.com>
> wrote:
> > >
> > >> Easiest way would be to pre-process your input and join those 2 tokens
> > >> before splitting them by white space.
> > >>
> > >> But from given context I might miss some details...still worth a shot.
> > >>
> > >> On Fri, Dec 21, 2012 at 9:50 AM, Xi Shen <davidshen84@gmail.com>
> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I am looking for a token filter that can combine 2 terms into 1? E.g.
> > >>>
> > >>> the input has been tokenized by white space:
> > >>>
> > >>> t1 t2 t2a t3
> > >>>
> > >>> I want a filter that output:
> > >>>
> > >>> t1 t2t2a t3
> > >>>
> > >>> I know it is a very special case, and I am thinking about develop a
> > >> filter
> > >>> of my own. But I cannot figure out which API I should use to look for
> > >> terms
> > >>> in a Token Stream.
> > >>>
> > >>> --
> > >>> Regards,
> > >>> David Shen
> > >>>
> > >>> http://about.me/davidshen
> > >>> https://twitter.com/#!/davidshen84
> > >>>
> > >>
> > >
> > >
> > >
> > > --
> > > Regards,
> > > David Shen
> > >
> > > http://about.me/davidshen
> > > https://twitter.com/#!/davidshen84
> >
> >
>
>
> --
> Regards,
> David Shen
>
> http://about.me/davidshen
> https://twitter.com/#!/davidshen84
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message