lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <karl.wet...@gmail.com>
Subject Re: Phase Extraction, mainly for English
Date Wed, 07 Oct 2009 00:24:11 GMT
There are many uses for shingles.

I've used them to find common phrases in text, which is my  
understanding of what you try to achieve. It works rather well, is a  
very simple solution and easy on resources compared to real semantic  
analysis.

You'll be getting a lot of shingles such as "there is" and "we are",  
but using a stop word lists to filter out any shingle contaning one or  
many of the stop words should do the trick (I did that in post  
processing, keeping all shingles in my index). It will probably  
require bit of manual work, depending on your corpora, to get a really  
clean list of common phrases that makes sense. Just create a list and  
inspect it with your eyes an try to find patterns in the phrases you  
want to get rid of. You might also want to look for punctuation in  
your text to avoid creating shingles of text that is in diffrent  
sentences. There is a pretty good sentence extraction tool in Gate you  
can use.


      karl

7 okt 2009 kl. 01.39 skrev Andrew Zhang:

> Hi Karl,
>
> I think shingle is designed to make the phase search faster, it'll  
> generate
> a lot of "seemed like" phase by pos only and completely disregard the
> meaning, that's not good enough.
>
> Regards,
> Andrew
>
> On Tue, Oct 6, 2009 at 11:51 PM, Karl Wettin <karl.wettin@gmail.com>  
> wrote:
>
>> Hi Andrew,
>>
>> I think you are looking for the shingle package in contrib/analyzers.
>>
>>
>>     karl
>>
>> 6 okt 2009 kl. 13.42 skrev Andrew Zhang:
>>
>>
>> Hi guys,
>>>
>>> The requirement is very simple here, e.g. for this sentence, 'The  
>>> NBA
>>> formally announced its new *social media* guidelines Wednesday',   
>>> I want
>>> to
>>> treat '*social media*' as a whole phase term. The default english
>>> analyzers
>>> came with lucene all deal with single word, so it you want to get  
>>> the most
>>> frequent terms, *social *and *media* are separated, and each of  
>>> them can't
>>> represent a good meaning as *social media*, right?
>>>
>>> I know there's a way built on some phase dictionary, and try to  
>>> match the
>>> phase already there, very like the way to do with chinese  
>>> language, but is
>>> there an open source solution for english, I mean I don't want to  
>>> build a
>>> phase dictionary myself, and I also want a smart way, which can  
>>> "discover"
>>> the phase automatically. I got 2 millions docs analyzered the  
>>> norma way,
>>> all
>>> single terms, which I can use as a base source, and it's possible  
>>> to find
>>> that *social media *came together frequently, but I really don't  
>>> know
>>> what's
>>> the reverse way.
>>>
>>> I tried to find some phase analyzers, but no luck. so any advices?
>>>
>>> Regards,
>>> Andrew
>>> --
>>> Simple is best
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> -- 
> Simple is best


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message