mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Lucene Mahout > TF-IDF - Term Frequency-Inverse Document Frequency
Date Sat, 09 Jan 2010 14:55:00 GMT
Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: TF-IDF - Term Frequency-Inverse Document Frequency (http://cwiki.apache.org/confluence/display/MAHOUT/TF-IDF+-+Term+Frequency-Inverse+Document+Frequency)

Added by David Stuart:
---------------------------------------------------------------------
{excerpt}Is a weight measure often used in information retrieval and text mining. This weight
is a statistical measure used to evaluate how important a word is to a document in a collection
or corpus. The importance increases proportionally to the number of times a word appears in
the document but is offset by the frequency of the word in the corpus.{excerpt} In other words
if a term/word appears lots in a document but also appears lots in the corpus/collection as
a whole it will get a lower score. An example of this would be "the", "and", "it" but depending
on your source material it maybe other words that are very common to the source matter.


 See Also:
 * http://en.wikipedia.org/wiki/Tf%E2%80%93idf
 * http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html


Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action

Mime
View raw message