Return-Path: Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: (qmail 26253 invoked from network); 7 Feb 2011 14:48:00 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 Feb 2011 14:48:00 -0000 Received: (qmail 70476 invoked by uid 500); 7 Feb 2011 14:47:57 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 69677 invoked by uid 500); 7 Feb 2011 14:47:55 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 69193 invoked by uid 99); 7 Feb 2011 14:47:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Feb 2011 14:47:53 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Feb 2011 14:47:51 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 0780C1974CD for ; Mon, 7 Feb 2011 14:47:31 +0000 (UTC) Date: Mon, 7 Feb 2011 14:47:31 +0000 (UTC) From: "Szymon Chojnacki (JIRA)" To: dev@mahout.apache.org Message-ID: <196637458.3910.1297090051027.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <33525608.124741295699624118.JavaMail.jira@thor> Subject: [jira] Commented: (MAHOUT-588) Benchmark Mahout's clustering performance on EC2 and publish the results MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAHOUT-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991398#comment-12991398 ] Szymon Chojnacki commented on MAHOUT-588: ----------------------------------------- Thank you Ted for your support, the vectors in our 3-gram set have 935 173 coordinates, on average 372 452 are non-empty (39.8%). Currently we limited the dimensionality to around 500K by preserving only 1- and 2-grams. When we are successful with the smaller dimensionality we come back to the issue of hashed feature encoding with 3-grams. Regards > Benchmark Mahout's clustering performance on EC2 and publish the results > ------------------------------------------------------------------------ > > Key: MAHOUT-588 > URL: https://issues.apache.org/jira/browse/MAHOUT-588 > Project: Mahout > Issue Type: Task > Reporter: Grant Ingersoll > Attachments: SequenceFilesFromMailArchives.java, SequenceFilesFromMailArchives2.java, TamingAnalyzer.java, TamingAnalyzer.java, TamingAnalyzerTest.java, TamingCollocDriver.java, TamingCollocMapper.java, TamingDictVect.java, TamingDictionaryVectorizer.java, TamingGramKeyGroupComparator.java, TamingTFIDF.java, TamingTokenizer.java, Top1000Tokens_maybe_stopWords, Uncompress.java, clusters1.txt, clusters_kMeans.txt, distcp_large_to_s3_failed.log, ec2_setup_notes.txt, seq2sparse_small_failed.log, seq2sparse_xlarge_ok.log > > > For Taming Text, I've commissioned some benchmarking work on Mahout's clustering algorithms. I've asked the two doing the project to do all the work in the open here. The goal is to use a publicly reusable dataset (for now, the ASF mail archives, assuming it is big enough) and run on EC2 and make all resources available so others can reproduce/improve. > I'd like to add the setup code to utils (although it could possibly be done as a Vectorizer) and the publication of the results will be put up on the Wiki as well as in the book. This issue is to track the patches, etc. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira