flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1735) Add FeatureHasher to machine learning library
Date Sun, 10 May 2015 20:06:59 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14537333#comment-14537333

ASF GitHub Bot commented on FLINK-1735:

Github user FelixNeutatz commented on the pull request:

    This will be further implemented here: https://github.com/apache/flink/pull/665

> Add FeatureHasher to machine learning library
> ---------------------------------------------
>                 Key: FLINK-1735
>                 URL: https://issues.apache.org/jira/browse/FLINK-1735
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Felix Neutatz
>              Labels: ML
> Using the hashing trick [1,2] is a common way to vectorize arbitrary feature values.
The hash of the feature value is used to calculate its index for a vector entry. In order
to mitigate possible collisions, a second hashing function is used to calculate the sign for
the update value which is added to the vector entry. This way, it is likely that collision
will simply cancel out.
> A feature hasher would also be helpful for NLP problems where it could be used to vectorize
bag of words or ngrams feature vectors.
> Resources:
> [1] [https://en.wikipedia.org/wiki/Feature_hashing]
> [2] [http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction]

This message was sent by Atlassian JIRA

View raw message