mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (Resolved) (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (MAHOUT-957) term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering
Date Sun, 29 Jan 2012 00:58:10 GMT

     [ https://issues.apache.org/jira/browse/MAHOUT-957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Grant Ingersoll resolved MAHOUT-957.
------------------------------------

    Resolution: Fixed
    
> term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma
filtering
> ------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-957
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-957
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: John Conwell
>            Assignee: Grant Ingersoll
>             Fix For: 0.6
>
>         Attachments: MAHOUT-957.patch
>
>
> The SparseVectorsFromSequenceFiles throws an exception when you want term frequency vectors
output, with the maxDFSigma filtering option.
> Basically the if / else if section shown below, will skip calling DictionaryVectorizer.createTermFrequencyVectors
when have that combination.  The condition will create vectors when you want tf vectors without
maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering, but if you want tf vectors
with maxDFSigma filtering, it totally skips over the call to createTermFrequencyVectors, and
later on throws an exception because the vector input path doesn't exist.
> For example, the following cmd line will reproduce this situation:
> bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o /Users/me/Documents/workspace/mahoutStuff/termvecs
-wt tf --minSupport 2 --minDF 2 --maxDFSigma 3 -seq
> //the suspect code at line ~267 in DictionaryVectorizer.createTermFrequencyVectors
> if (!processIdf && !shouldPrune) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName,
conf, minSupport, maxNGramSize,
>           minLLRValue, norm, logNormalize, reduceTasks, chunkSize, sequentialAccessOutput,
namedVectors);
> } else if (processIdf) {
>         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName,
conf, minSupport, maxNGramSize,
>           minLLRValue, -1.0f, false, reduceTasks, chunkSize, sequentialAccessOutput,
namedVectors);
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message