lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Zhang <rooseve6...@gmail.com>
Subject Re: Phase Extraction, mainly for English
Date Tue, 06 Oct 2009 23:39:47 GMT
Hi Karl,

I think shingle is designed to make the phase search faster, it'll generate
a lot of "seemed like" phase by pos only and completely disregard the
meaning, that's not good enough.

Regards,
Andrew

On Tue, Oct 6, 2009 at 11:51 PM, Karl Wettin <karl.wettin@gmail.com> wrote:

> Hi Andrew,
>
> I think you are looking for the shingle package in contrib/analyzers.
>
>
>      karl
>
> 6 okt 2009 kl. 13.42 skrev Andrew Zhang:
>
>
>  Hi guys,
>>
>> The requirement is very simple here, e.g. for this sentence, 'The NBA
>> formally announced its new *social media* guidelines Wednesday',  I want
>> to
>> treat '*social media*' as a whole phase term. The default english
>> analyzers
>> came with lucene all deal with single word, so it you want to get the most
>> frequent terms, *social *and *media* are separated, and each of them can't
>> represent a good meaning as *social media*, right?
>>
>> I know there's a way built on some phase dictionary, and try to match the
>> phase already there, very like the way to do with chinese language, but is
>> there an open source solution for english, I mean I don't want to build a
>> phase dictionary myself, and I also want a smart way, which can "discover"
>> the phase automatically. I got 2 millions docs analyzered the norma way,
>> all
>> single terms, which I can use as a base source, and it's possible to find
>> that *social media *came together frequently, but I really don't know
>> what's
>> the reverse way.
>>
>> I tried to find some phase analyzers, but no luck. so any advices?
>>
>> Regards,
>> Andrew
>> --
>> Simple is best
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Simple is best

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message