mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-973) SparseVectorsFromSequenceFiles will not create a proper TFIDF (bug in TFIDFPartialVectorReducer)
Date Fri, 06 Apr 2012 17:11:23 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248501#comment-13248501
] 

Hudson commented on MAHOUT-973:
-------------------------------

Integrated in Mahout-Quality #1427 (See [https://builds.apache.org/job/Mahout-Quality/1427/])
    MAHOUT-973 one more file needed for fix to compute maxDF as a percent of total count (Revision
1310357)

     Result = SUCCESS
srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1310357
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java

                
> SparseVectorsFromSequenceFiles will not create a proper TFIDF (bug in TFIDFPartialVectorReducer)
> ------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-973
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-973
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.6
>            Reporter: Viktor Gal
>            Assignee: Sean Owen
>             Fix For: 0.7
>
>         Attachments: fix-TFIDFPartialVectorReducer.patch
>
>
> Although I'm using a little bit different the TFIDFConverter, but the problem will occur
the same way with SparseVectorsFromSequenceFiles when somebody wants to create a TFIDF vectors
for their documents.
> Basically if maxDFSigma is not set then because of SparseVectorsFromSequenceFiles.java:281
> long maxDF = maxDFPercent;
> maxDF will be 99. which is then passed to TFIDFConvert.processTfIdf function as an argument,
where it is interpreted as "The max percentage of vectors for the DF." Partial vectors will
be created with TFIDFPartialVectorReducer.class and because of TFIDFPartialVectorReducer.java:81
as maxDF = 99 if (df > maxDF) the term will be ignored.
> the problem here is that two different quantities are compared. df value is the number
of documents which contains the given term, and it's not normalized by the document number,
i.e. it's not a percentage! see TermDocumentCountReducer.java for details. while maxDF is
interpreted as a percentage, see above. Thus, as soon as the df count gets higher than 99,
or in the best case 100, meaning the given term occurs in more than 99 or 100 different documents,
it'll be ignored... and this is not what we would like it to do.
> I.e. there's a bug in TFIDFPartialVectorReducer.java at line 81.
> I've attached a possible fix for this problem.
> the bug was introduced a61e5ff8 commit (git) or rev 1210994 in svn:
> @@ -78,7 +78,7 @@ public class TFIDFPartialVectorReducer extends
>          continue;
>        }
>        long df = dictionary.get(e.index());
> -      if (df * 100.0 / vectorCount > maxDfPercent) {
> +      if (maxDf > -1 && df > maxDf) {
>          continue;
>        }
>        if (df < minDf) {

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message