lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Zhang <>
Subject Re: Phase Extraction, mainly for English
Date Tue, 06 Oct 2009 23:39:47 GMT
Hi Karl,

I think shingle is designed to make the phase search faster, it'll generate
a lot of "seemed like" phase by pos only and completely disregard the
meaning, that's not good enough.


On Tue, Oct 6, 2009 at 11:51 PM, Karl Wettin <> wrote:

> Hi Andrew,
> I think you are looking for the shingle package in contrib/analyzers.
>      karl
> 6 okt 2009 kl. 13.42 skrev Andrew Zhang:
>  Hi guys,
>> The requirement is very simple here, e.g. for this sentence, 'The NBA
>> formally announced its new *social media* guidelines Wednesday',  I want
>> to
>> treat '*social media*' as a whole phase term. The default english
>> analyzers
>> came with lucene all deal with single word, so it you want to get the most
>> frequent terms, *social *and *media* are separated, and each of them can't
>> represent a good meaning as *social media*, right?
>> I know there's a way built on some phase dictionary, and try to match the
>> phase already there, very like the way to do with chinese language, but is
>> there an open source solution for english, I mean I don't want to build a
>> phase dictionary myself, and I also want a smart way, which can "discover"
>> the phase automatically. I got 2 millions docs analyzered the norma way,
>> all
>> single terms, which I can use as a base source, and it's possible to find
>> that *social media *came together frequently, but I really don't know
>> what's
>> the reverse way.
>> I tried to find some phase analyzers, but no luck. so any advices?
>> Regards,
>> Andrew
>> --
>> Simple is best
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Simple is best

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message