flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Rohrmann (JIRA)" <j...@apache.org>
Subject [jira] [Created] (FLINK-1735) Add FeatureHasher to machine learning library
Date Wed, 18 Mar 2015 14:54:38 GMT
Till Rohrmann created FLINK-1735:
------------------------------------

             Summary: Add FeatureHasher to machine learning library
                 Key: FLINK-1735
                 URL: https://issues.apache.org/jira/browse/FLINK-1735
             Project: Flink
          Issue Type: Improvement
          Components: Machine Learning Library
            Reporter: Till Rohrmann


Using the hashing trick [1,2] is a common way to vectorize arbitrary feature values. The hash
of the feature value is used to calculate its index for a vector entry. In order to mitigate
possible collisions, a second hashing function is used to calculate the sign for the update
value which is added to the vector entry. This way, it is likely that collision will simply
cancel out.

A feature hasher would also be helpful for NLP problems where it could be used to vectorize
bag of words or ngrams feature vectors.

Resources:
[1] [https://en.wikipedia.org/wiki/Feature_hashing]
[2] [http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message