flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] (FLINK-2094) Implement Word2Vec
Date Tue, 31 Jan 2017 08:12:44 GMT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml"> 
    <head> 
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
        <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0"
/> <base href="https://issues.apache.org/jira" /> 
        <title>Message Title</title> 
    </head> 
    <body class="jira" style="color: #333; font-family: Arial, sans-serif; font-size: 14px;
line-height: 1.429"> 
        <table id="background-table" cellpadding="0" cellspacing="0" width="100%" style="border-collapse:
collapse; mso-table-lspace: 0pt; mso-table-rspace: 0pt; background-color: #f5f5f5; border-collapse:
collapse; mso-table-lspace: 0pt; mso-table-rspace: 0pt"> 
            <!-- header here --> 
            <tr> 
                <td id="header-pattern-container" style="padding: 0px; border-collapse:
collapse; padding: 10px 20px"> 
                    <table id="header-pattern" cellspacing="0" cellpadding="0" border="0"
style="border-collapse: collapse; mso-table-lspace: 0pt; mso-table-rspace: 0pt"> 
                        <tr> 
                            <td id="header-avatar-image-container" valign="top" style="padding:
0px; border-collapse: collapse; vertical-align: top; width: 32px; padding-right: 8px">
<img id="header-avatar-image" class="image_fix" src="cid:jira-generated-image-avatar-githubbot-7ff9f1a8-d2f3-446a-a178-06346f838d6a"
height="32" width="32" border="0" style="border-radius: 3px; vertical-align: top" /> 
                            </td> 
                            <td id="header-text-container" valign="middle" style="padding:
0px; border-collapse: collapse; vertical-align: middle; font-family: Arial, sans-serif; font-size:
14px; line-height: 20px; mso-line-height-rule: exactly; mso-text-raise: 1px"> <a class="user-hover"
rel="githubbot" id="email_githubbot" href="https://issues.apache.org/jira/secure/ViewProfile.jspa?name=githubbot"
style="color:#3b73af;; color: #3b73af; text-decoration: none">ASF GitHub Bot</a>
<strong>commented</strong> on <a href="https://issues.apache.org/jira/browse/FLINK-2094"
style="color: #3b73af; text-decoration: none"><img src="cid:jira-generated-image-static-improvement-b693f1b0-b9d7-4b52-882d-0733cd2ae731"
height="16" width="16" border="0" align="absmiddle" alt="Improvement" /> FLINK-2094</a>

                            </td> 
                        </tr> 
                    </table> 
                </td> 
            </tr> 
            <tr> 
                <td id="email-content-container" style="padding: 0px; border-collapse:
collapse; padding: 0 20px"> 
                    <table id="email-content-table" cellspacing="0" cellpadding="0" border="0"
width="100%" style="border-collapse: collapse; mso-table-lspace: 0pt; mso-table-rspace: 0pt;
border-spacing: 0; border-collapse: separate"> 
                        <tr> 
                            <!-- there needs to be content in the cell for it to render
in some clients --> 
                            <td class="email-content-rounded-top mobile-expand" style="padding:
0px; border-collapse: collapse; color: #fff; padding: 0 15px 0 16px; height: 15px; background-color:
#fff; border-left: 1px solid #ccc; border-top: 1px solid #ccc; border-right: 1px solid #ccc;
border-bottom: 0; border-top-right-radius: 5px; border-top-left-radius: 5px; height: 10px;
line-height: 10px; padding: 0 15px 0 16px; mso-line-height-rule: exactly">
                                &nbsp;
                            </td> 
                        </tr> 
                        <tr> 
                            <td class="email-content-main mobile-expand " style="padding:
0px; border-collapse: collapse; border-left: 1px solid #ccc; border-right: 1px solid #ccc;
border-top: 0; border-bottom: 0; padding: 0 15px 0 16px; background-color: #fff"> 
                                <table class="page-title-pattern" cellspacing="0" cellpadding="0"
border="0" width="100%" style="border-collapse: collapse; mso-table-lspace: 0pt; mso-table-rspace:
0pt"> 
                                    <tr> 
                                        <td style="vertical-align: top;; padding: 0px;
border-collapse: collapse; padding-right: 5px; font-size: 20px; line-height: 30px; mso-line-height-rule:
exactly" class="page-title-pattern-header-container"> <span class="page-title-pattern-header"
style="font-family: Arial, sans-serif; padding: 0; font-size: 20px; line-height: 30px; mso-text-raise:
2px; mso-line-height-rule: exactly; vertical-align: middle"> <a href="https://issues.apache.org/jira/browse/FLINK-2094"
style="color: #3b73af; text-decoration: none">Re: Implement Word2Vec</a> </span>

                                        </td> 
                                    </tr> 
                                </table> 
                            </td> 
                        </tr> 
                        <tr> 
                            <td id="text-paragraph-pattern-top" class="email-content-main
mobile-expand  comment-top-pattern" style="padding: 0px; border-collapse: collapse; border-left:
1px solid #ccc; border-right: 1px solid #ccc; border-top: 0; border-bottom: 0; padding: 0
15px 0 16px; background-color: #fff; border-bottom: none; padding-bottom: 0"> 
                                <table class="text-paragraph-pattern" cellspacing="0" cellpadding="0"
border="0" width="100%" style="border-collapse: collapse; mso-table-lspace: 0pt; mso-table-rspace:
0pt; font-family: Arial, sans-serif; font-size: 14px; line-height: 20px; mso-line-height-rule:
exactly; mso-text-raise: 2px"> 
                                    <tr> 
                                        <td class="text-paragraph-pattern-container mobile-resize-text
" style="padding: 0px; border-collapse: collapse; padding: 0 0 10px 0"> 
                                            <p style="margin: 10px 0 0 0">Github user
kateri1 commented on a diff in the pull request:</p> 
                                            <p style="margin: 10px 0 0 0"> <a href="https://github.com/apache/flink/pull/2735#discussion_r98613624"
class="external-link" rel="nofollow" style="color: #3b73af; text-decoration: none">https://github.com/apache/flink/pull/2735#discussion_r98613624</a></p>

                                            <p style="margin: 10px 0 0 0"> — Diff:
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nlp/Word2Vec.scala —<br />
@@ -0,0 +1,243 @@<br /> +/*<br /> + * Licensed to the Apache Software Foundation
(ASF) under one<br /> + * or more contributor license agreements. See the NOTICE file<br
/> + * distributed with this work for additional information<br /> + * regarding
copyright ownership. The ASF licenses this file<br /> + * to you under the Apache License,
Version 2.0 (the<br /> + * &quot;License&quot;); you may not use this file except
in compliance<br /> + * with the License. You may obtain a copy of the License at<br
/> + *<br /> + * <a href="http://www.apache.org/licenses/LICENSE-2.0" class="external-link"
rel="nofollow" style="color: #3b73af; text-decoration: none">http://www.apache.org/licenses/LICENSE-2.0</a><br
/> + *<br /> + * Unless required by applicable law or agreed to in writing, software<br
/> + * distributed under the License is distributed on an &quot;AS IS&quot; BASIS,<br
/> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.<br />
+ * See the License for the specific language governing permissions and<br /> + * limitations
under the License.<br /> + */<br /> +<br /> +package org.apache.flink.ml.nlp<br
/> +<br /> +import org.apache.flink.api.scala._<br /> +import org.apache.flink.ml.common.</p>
{Parameter, ParameterMap} 
                                            <p style="margin: 10px 0 0 0"> +import org.apache.flink.ml.optimization.</p>
{Context, ContextEmbedder, HSMWeightMatrix} 
                                            <p style="margin: 10px 0 0 0"> +import org.apache.flink.ml.pipeline.</p>
{FitOperation, TransformDataSetOperation, Transformer} 
                                            <p style="margin: 10px 0 0 0"> +<br />
+/**<br /> + * Implements Word2Vec as a transformer on a DataSet[Iterable<span class="error">[String]</span>]<br
/> + *<br /> + * Calculates valuable vectorizations of individual words given<br
/> + * the context in which they appear<br /> + *<br /> + * @example<br
/> + * {{</p> { + * //constructed of 'sentences' - where each string in the iterable
is a word + * val stringsDS = DataSet[Iterable[String]] = ... + * val stringsDS2 = DataSet[Iterable[String]]
= ... + * + * val w2V = Word2Vec() + * .setIterations(5) + * .setTargetCount(10) + * .setSeed(500)
+ * + * //internalizes an initial weightSet + * w2V.fit(stringsDS) + * + * //note that the
same DS can be used to fit and optimize + * //the number of learned vectors is limted to the
vocab built in fit + * val wordVectors : DataSet[(String, Vector[Double])] = w2V.optimize(stringsDS2)
+ * } 
                                            <p style="margin: 10px 0 0 0">}}<br />
+ *<br /> + * =Parameters=<br /> + *<br /> + * - [<span class="error">[org.apache.flink.ml.nlp.Word2Vec.WindowSize]</span>]<br
/> + * sets the size of window for skipGram formation: how far on either side of<br
/> + * a given word will we sample the context? (Default value: '''10''')<br /> +
*<br /> + * - [<span class="error">[org.apache.flink.ml.nlp.Word2Vec.Iterations]</span>]<br
/> + * sets the number of global iterations the training set is passed through - essentially
looping on<br /> + * whole set, leveraging flink's iteration operator (Default value:
'''10''')<br /> + *<br /> + * - [<span class="error">[org.apache.flink.ml.nlp.Word2Vec.TargetCount]</span>]<br
/> + * sets the minimum number of occurences of a given target value before that value
is<br /> + * excluded from vocabulary (e.g. if this parameter is set to '5', and a target<br
/> + * appears in the training set less than 5 times, it is not included in vocabulary)<br
/> + * (Default value: '''5''')<br /> + *<br /> + * - [<span class="error">[org.apache.flink.ml.nlp.Word2Vec.VectorSize]</span>]<br
/> + * sets the length of each learned vector (Default value: '''100''')<br /> +
*<br /> + * - [<span class="error">[org.apache.flink.ml.nlp.Word2Vec.LearningRate]</span>]<br
/> + * sets the rate of descent during backpropagation - this value decays linearly with<br
/> + * individual training sets, determined by BatchSize (Default value: '''0.015''')<br
/> + *<br /> + * - [<span class="error">[org.apache.flink.ml.nlp.Word2Vec.BatchSize]</span>]<br
/> + * sets the batch size of training sets - the input DataSet will be batched into<br
/> + * groups of this size for learning (Default value: '''1000''')<br /> + *<br
/> + * - [<span class="error">[org.apache.flink.ml.nlp.Word2Vec.Seed]</span>]<br
/> + * sets the seed for generating random vectors at initial weighting DataSet creation<br
/> + * (Default value: '''Some(scala.util.Random.nextLong)''')<br /> + */<br />
+class Word2Vec extends Transformer<span class="error">[Word2Vec]</span> {<br
/> + import Word2Vec._<br /> +<br /> + private <span class="error">[nlp]</span>
var wordVectors:<br /> + Option[DataSet[HSMWeightMatrix<span class="error">[String]</span>]]
= None<br /> +<br /> + def setIterations(iterations: Int): this.type = </p>
{ + parameters.add(Iterations, iterations) + this + } 
                                            <p style="margin: 10px 0 0 0"> +<br />
+ def setTargetCount(targetCount: Int): this.type = </p> { + parameters.add(TargetCount,
targetCount) + this + } 
                                            <p style="margin: 10px 0 0 0"> +<br />
+ def setVectorSize(vectorSize: Int): this.type = </p> { + parameters.add(VectorSize,
vectorSize) + this + } 
                                            <p style="margin: 10px 0 0 0"> +<br />
+ def setLearningRate(learningRate: Double): this.type = </p> { + parameters.add(LearningRate,
learningRate) + this + } 
                                            <p style="margin: 10px 0 0 0"> +<br />
+ def setWindowSize(windowSize: Int): this.type = </p> { + parameters.add(WindowSize,
windowSize) + this + } 
                                            <p style="margin: 10px 0 0 0"> +<br />
+ def setBatchSize(batchSize: Int): this.type = </p> { + parameters.add(BatchSize, batchSize)
+ this + } 
                                            <p style="margin: 10px 0 0 0"> +<br />
+ def setSeed(seed: Long): this.type = </p> { + parameters.add(Seed, seed) + this +
} 
                                            <p style="margin: 10px 0 0 0"> +<br />
+}<br /> +<br /> +object Word2Vec {<br /> + case object Iterations extends
Parameter<span class="error">[Int]</span> </p> { + val defaultValue = Some(10)
+ }<br /> +<br /> + case object TargetCount extends Parameter<span class="error">[Int]</span>
{ + val defaultValue = Some(5) + }<br /> +<br /> + case object VectorSize extends
Parameter<span class="error">[Int]</span> { + val defaultValue = Some(100) + }<br
/> +<br /> + case object LearningRate extends Parameter<span class="error">[Double]</span>
{ + val defaultValue = Some(0.015) + }<br /> +<br /> + case object WindowSize
extends Parameter<span class="error">[Int]</span> { + val defaultValue = Some(10)
+ } 
                                            <p style="margin: 10px 0 0 0"> +<br />
+ case object BatchSize extends Parameter<span class="error">[Int]</span> </p>
{ + val defaultValue = Some(1000) + } 
                                            <p style="margin: 10px 0 0 0"> +<br />
+ case object Seed extends Parameter<span class="error">[Long]</span> </p>
{ + val defaultValue = Some(scala.util.Random.nextLong) + } 
                                            <p style="margin: 10px 0 0 0"> +<br />
+ def apply(): Word2Vec = </p> { + new Word2Vec() + } 
                                            <p style="margin: 10px 0 0 0"> +<br />
+ /** [<span class="error">[FitOperation]</span>] which builds initial vocabulary
for Word2Vec context embedding<br /> + *<br /> + * @tparam T Subtype of Iterable<span
class="error">[String]</span><br /> + * @return<br /> + */<br />
+ implicit def learnWordVectors[T &lt;: Iterable<span class="error">[String]</span>]
= {<br /> + new FitOperation<span class="error">[Word2Vec, T]</span> {<br
/> + override def fit(<br /> + instance: Word2Vec,<br /> + fitParameters: ParameterMap,<br
/> + input: DataSet<span class="error">[T]</span>)<br /> + : Unit = {<br
/> + val resultingParameters = instance.parameters ++ fitParameters<br /> + <br
/> + val skipGrams = input<br /> + .flatMap(x =&gt;<br /> — End diff
–</p> 
                                            <p style="margin: 10px 0 0 0"> copypased
code is used in methods learnWordVectors and words2Vecs, consider to create a function for
this repeating code to simplify potentional refactoring.</p> 
                                        </td> 
                                    </tr> 
                                </table> 
                            </td> 
                        </tr> 
                        <tr> 
                            <td class="email-content-main mobile-expand " style="padding:
0px; border-collapse: collapse; border-left: 1px solid #ccc; border-right: 1px solid #ccc;
border-top: 0; border-bottom: 0; padding: 0 15px 0 16px; background-color: #fff"> 
                                <table id="actions-pattern" cellspacing="0" cellpadding="0"
border="0" width="100%" style="border-collapse: collapse; mso-table-lspace: 0pt; mso-table-rspace:
0pt; font-family: Arial, sans-serif; font-size: 14px; line-height: 20px; mso-line-height-rule:
exactly; mso-text-raise: 1px"> 
                                    <tr> 
                                        <td id="actions-pattern-container" valign="middle"
style="padding: 0px; border-collapse: collapse; padding: 10px 0 10px 24px; vertical-align:
middle; padding-left: 0"> 
                                            <table align="left" style="border-collapse:
collapse; mso-table-lspace: 0pt; mso-table-rspace: 0pt"> 
                                                <tr> 
                                                    <td class="actions-pattern-action-icon-container"
style="padding: 0px; border-collapse: collapse; font-family: Arial, sans-serif; font-size:
14px; line-height: 20px; mso-line-height-rule: exactly; mso-text-raise: 0px; vertical-align:
middle"> <a href="https://issues.apache.org/jira/browse/FLINK-2094#add-comment" target="_blank"
title="Add Comment" style="color: #3b73af; text-decoration: none"> <img class="actions-pattern-action-icon-image"
src="cid:jira-generated-image-static-comment-icon-0564f750-d366-4f32-a1ec-32711a34852d" alt="Add
Comment" title="Add Comment" height="16" width="16" border="0" style="vertical-align: middle"
/> </a> 
                                                    </td> 
                                                    <td class="actions-pattern-action-text-container"
style="padding: 0px; border-collapse: collapse; font-family: Arial, sans-serif; font-size:
14px; line-height: 20px; mso-line-height-rule: exactly; mso-text-raise: 4px; padding-left:
5px"> <a href="https://issues.apache.org/jira/browse/FLINK-2094#add-comment" target="_blank"
title="Add Comment" style="color: #3b73af; text-decoration: none">Add Comment</a>

                                                    </td> 
                                                </tr> 
                                            </table> 
                                        </td> 
                                    </tr> 
                                </table> 
                            </td> 
                        </tr> 
                        <!-- there needs to be content in the cell for it to render in
some clients --> 
                        <tr> 
                            <td class="email-content-rounded-bottom mobile-expand" style="padding:
0px; border-collapse: collapse; color: #fff; padding: 0 15px 0 16px; height: 5px; line-height:
5px; background-color: #fff; border-top: 0; border-left: 1px solid #ccc; border-bottom: 1px
solid #ccc; border-right: 1px solid #ccc; border-bottom-right-radius: 5px; border-bottom-left-radius:
5px; mso-line-height-rule: exactly">
                                &nbsp;
                            </td> 
                        </tr> 
                    </table> 
                </td> 
            </tr> 
            <tr> 
                <td id="footer-pattern" style="padding: 0px; border-collapse: collapse;
padding: 12px 20px"> 
                    <table id="footer-pattern-container" cellspacing="0" cellpadding="0"
border="0" style="border-collapse: collapse; mso-table-lspace: 0pt; mso-table-rspace: 0pt">

                        <tr> 
                            <td id="footer-pattern-text" class="mobile-resize-text" width="100%"
style="padding: 0px; border-collapse: collapse; color: #999; font-size: 12px; line-height:
18px; font-family: Arial, sans-serif; mso-line-height-rule: exactly; mso-text-raise: 2px">
                                 This message was sent by Atlassian JIRA <span id="footer-build-information">(v6.3.15#6346-<span
title="dbc023dd75cecacf443c4b235f66124b15f5c5fe" data-commit-id="dbc023dd75cecacf443c4b235f66124b15f5c5fe}">sha1:dbc023d</span>)</span>

                            </td> 
                            <td id="footer-pattern-logo-desktop-container" valign="top"
style="padding: 0px; border-collapse: collapse; padding-left: 20px; vertical-align: top">

                                <table style="border-collapse: collapse; mso-table-lspace:
0pt; mso-table-rspace: 0pt"> 
                                    <tr> 
                                        <td id="footer-pattern-logo-desktop-padding" style="padding:
0px; border-collapse: collapse; padding-top: 3px"> <img id="footer-pattern-logo-desktop"
src="cid:jira-generated-image-static-footer-desktop-logo-9f1170d6-7670-40e7-90bc-02c5b087f40f"
alt="Atlassian logo" title="Atlassian logo" width="169" height="36" class="image_fix" />

                                        </td> 
                                    </tr> 
                                </table> 
                            </td> 
                        </tr> 
                    </table> 
                </td> 
            </tr> 
        </table>   
    </body>
</html>
Mime
View raw message