Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BFC0495FA for ; Sat, 28 Jan 2012 15:25:34 +0000 (UTC) Received: (qmail 66016 invoked by uid 500); 28 Jan 2012 15:25:34 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 65811 invoked by uid 500); 28 Jan 2012 15:25:33 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 65802 invoked by uid 99); 28 Jan 2012 15:25:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 28 Jan 2012 15:25:33 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 28 Jan 2012 15:25:30 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 294B3169F65 for ; Sat, 28 Jan 2012 15:25:10 +0000 (UTC) Date: Sat, 28 Jan 2012 15:25:10 +0000 (UTC) From: "Grant Ingersoll (Commented) (JIRA)" To: dev@mahout.apache.org Message-ID: <364812882.4556.1327764310171.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <2003209837.74746.1327450480162.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (MAHOUT-957) term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAHOUT-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195565#comment-13195565 ] Grant Ingersoll commented on MAHOUT-957: ---------------------------------------- I committed my patch. John, does that fix things for you? > term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering > ------------------------------------------------------------------------------------------------------ > > Key: MAHOUT-957 > URL: https://issues.apache.org/jira/browse/MAHOUT-957 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.6 > Reporter: John Conwell > Assignee: Grant Ingersoll > Fix For: 0.6 > > Attachments: MAHOUT-957.patch > > > The SparseVectorsFromSequenceFiles throws an exception when you want term frequency vectors output, with the maxDFSigma filtering option. > Basically the if / else if section shown below, will skip calling DictionaryVectorizer.createTermFrequencyVectors when have that combination. The condition will create vectors when you want tf vectors without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering, but if you want tf vectors with maxDFSigma filtering, it totally skips over the call to createTermFrequencyVectors, and later on throws an exception because the vector input path doesn't exist. > For example, the following cmd line will reproduce this situation: > bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o /Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2 --minDF 2 --maxDFSigma 3 -seq > //the suspect code at line ~267 in DictionaryVectorizer.createTermFrequencyVectors > if (!processIdf && !shouldPrune) { > DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize, > minLLRValue, norm, logNormalize, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors); > } else if (processIdf) { > DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize, > minLLRValue, -1.0f, false, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors); > } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira