lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Hot to get word importance in lucene index
Date Fri, 23 Jul 2010 11:43:52 GMT
Couple of thoughts inline...

On Jul 22, 2010, at 10:44 PM, Xaida wrote:

> 
> Hi all!
> 
> hmmm, i need to get how important is the word in entire document collection
> that is indexed in the lucene index. I need to extract some "representable
> words", lets say concepts that are common and can be representable to whole
> collection. Or collection "keywords". I did the fulltext indexing and the
> only field i am using are text contents, because titles of the documents are
> mostly not representable(numbers, codes etc....)
> 
> So, if i calculate tfidf, it gives me importance of single term with respect
> to single document.

TF gives you the importance in a single document.
IDF gives you the inverse of importance across the collection

> But if that word is repeating in the documents, how can
> i calculate its total importance within index?


Also, Lucene can also normalize by length, which is often a part of these things too.  

This information can be retrieved from TermDocs, TermEnum, etc.

Also, as a related item, you may be interested in important phrases, which can often be more
helpful.  Check out https://cwiki.apache.org/confluence/display/MAHOUT/Collocations for one
way of doing that.

-Grant

---------------------
Grant Ingersoll
http://www.lucidimagination.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message