mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-738) Collocation driver has long being statically cast to an int
Date Wed, 22 Jun 2011 19:30:47 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053413#comment-13053413
] 

Hudson commented on MAHOUT-738:
-------------------------------

Integrated in Mahout-Quality #897 (See [https://builds.apache.org/job/Mahout-Quality/897/])
    MAHOUT-738 treat input to LLR as long to avoid possible overflow

srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1138553
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedUncenteredZeroAssumingCosineVectorSimilarity.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedPearsonCorrelationVectorSimilarity.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedLoglikelihoodVectorSimilarity.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/collocations/llr/LLRReducer.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedEuclideanDistanceVectorSimilarity.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/similarity/LogLikelihoodSimilarity.java
* /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/vectors/lucene/ClusterLabels.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedTanimotoCoefficientVectorSimilarity.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/LogLikelihood.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedUncenteredCosineVectorSimilarity.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedCityBlockVectorSimilarity.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedVectorSimilarity.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/collocations/llr/LLRReducerTest.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/AbstractDistributedVectorSimilarity.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/vector/DistributedCooccurrenceVectorSimilarity.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/similarity/TestRowSimilarityJob.java


> Collocation driver has long being statically cast to an int
> -----------------------------------------------------------
>
>                 Key: MAHOUT-738
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-738
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: peter andrews
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: MAHOUT-739.patch
>
>
> org.apache.mahout.vectorizer.collocations.llr.LLRReducer, which is part of the collocation
driver, statically casts a long to an int.
> private long ngramTotal;
> ...
> int k11 = ngram.getFrequency(); /* a&b */
> int k12 = gramFreq[0] - ngram.getFrequency(); /* a&!b */
> int k21 = gramFreq[1] - ngram.getFrequency(); /* !b&a */
> int k22 = (int) (ngramTotal - (gramFreq[0] + gramFreq[1] - ngram.getFrequency())); /*
!a&!b */
> These numbers are then fed into 
> org.apache.mahout.math.stats.LogLikelihood
> specifically the function below.
> public static double logLikelihoodRatio(int k11, int k12, int k21, int k22) {
>   // note that we have counts here, not probabilities, and that the entropy is not normalized.
>   double rowEntropy = entropy(k11, k12) + entropy(k21, k22);
>   double columnEntropy = entropy(k11, k21) + entropy(k12, k22);
>   double matrixEntropy = entropy(k11, k12, k21, k22);
>   if (rowEntropy + columnEntropy > matrixEntropy) {
>     // round off error
>     return 0.0;
>   }
>   return 2.0 * (matrixEntropy - rowEntropy - columnEntropy);
> }
> In short if the long ngramTotal is larger than Integer.MAX_VALUE (which will happen in
large datasets), then the driver will either crash or in the case that it casts to a negative
int, will continue as usual but produce no output due to error checking.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message