mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject vector encoding of text documents
Date Thu, 10 Jan 2013 17:21:20 GMT
i noticed that the mahout encoding of text documents in a k-means example
uses TF-IDF and then converts the documents into vectors (so one vector per
document) where every word gets mapped to an int for the vector index and
the words TFIDF score becomes the vector value for that index. it also
creates a dictionary to be able to get back from the index to the word. did
i get that right?

why this approach as opposed to the feature hashing that i saw in other
places in mahout?

thanks! koert

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message