lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Zhang <>
Subject Phase Extraction, mainly for English
Date Tue, 06 Oct 2009 11:42:23 GMT
Hi guys,

The requirement is very simple here, e.g. for this sentence, 'The NBA
formally announced its new *social media* guidelines Wednesday',  I want to
treat '*social media*' as a whole phase term. The default english analyzers
came with lucene all deal with single word, so it you want to get the most
frequent terms, *social *and *media* are separated, and each of them can't
represent a good meaning as *social media*, right?

I know there's a way built on some phase dictionary, and try to match the
phase already there, very like the way to do with chinese language, but is
there an open source solution for english, I mean I don't want to build a
phase dictionary myself, and I also want a smart way, which can "discover"
the phase automatically. I got 2 millions docs analyzered the norma way, all
single terms, which I can use as a base source, and it's possible to find
that *social media *came together frequently, but I really don't know what's
the reverse way.

I tried to find some phase analyzers, but no luck. so any advices?

Simple is best

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message