lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Zhang <rooseve6...@gmail.com>
Subject Re: Phase Extraction, mainly for English
Date Tue, 06 Oct 2009 23:42:23 GMT
Hi Erick,

If you want to query, you should know the "phase" right? but I want to
discover the phase, or which words came together so often and by the natural
way, we use that as a phase.



On Tue, Oct 6, 2009 at 8:12 PM, Erick Erickson <erickerickson@gmail.com>wrote:

> Maybe I'm missing the problem entirely, but can you use phrase queries?or
> one of the Span* queries with a slop of 0 when searching?
>
> Best
> Erick
>
> On Tue, Oct 6, 2009 at 7:42 AM, Andrew Zhang <rooseve6891@gmail.com>
> wrote:
>
> > Hi guys,
> >
> > The requirement is very simple here, e.g. for this sentence, 'The NBA
> > formally announced its new *social media* guidelines Wednesday',  I want
> to
> > treat '*social media*' as a whole phase term. The default english
> analyzers
> > came with lucene all deal with single word, so it you want to get the
> most
> > frequent terms, *social *and *media* are separated, and each of them
> can't
> > represent a good meaning as *social media*, right?
> >
> > I know there's a way built on some phase dictionary, and try to match the
> > phase already there, very like the way to do with chinese language, but
> is
> > there an open source solution for english, I mean I don't want to build a
> > phase dictionary myself, and I also want a smart way, which can
> "discover"
> > the phase automatically. I got 2 millions docs analyzered the norma way,
> > all
> > single terms, which I can use as a base source, and it's possible to find
> > that *social media *came together frequently, but I really don't know
> > what's
> > the reverse way.
> >
> > I tried to find some phase analyzers, but no luck. so any advices?
> >
> > Regards,
> > Andrew
> > --
> > Simple is best
> >
>



-- 
Simple is best

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message