lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <>
Subject Re: Phase Extraction, mainly for English
Date Tue, 06 Oct 2009 15:51:35 GMT
Hi Andrew,

I think you are looking for the shingle package in contrib/analyzers.


6 okt 2009 kl. 13.42 skrev Andrew Zhang:

> Hi guys,
> The requirement is very simple here, e.g. for this sentence, 'The NBA
> formally announced its new *social media* guidelines Wednesday',  I  
> want to
> treat '*social media*' as a whole phase term. The default english  
> analyzers
> came with lucene all deal with single word, so it you want to get  
> the most
> frequent terms, *social *and *media* are separated, and each of them  
> can't
> represent a good meaning as *social media*, right?
> I know there's a way built on some phase dictionary, and try to  
> match the
> phase already there, very like the way to do with chinese language,  
> but is
> there an open source solution for english, I mean I don't want to  
> build a
> phase dictionary myself, and I also want a smart way, which can  
> "discover"
> the phase automatically. I got 2 millions docs analyzered the norma  
> way, all
> single terms, which I can use as a base source, and it's possible to  
> find
> that *social media *came together frequently, but I really don't  
> know what's
> the reverse way.
> I tried to find some phase analyzers, but no luck. so any advices?
> Regards,
> Andrew
> -- 
> Simple is best

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message