mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-588) Benchmark Mahout's clustering performance on EC2 and publish the results
Date Fri, 25 Mar 2011 13:51:05 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011193#comment-13011193
] 

Grant Ingersoll commented on MAHOUT-588:
----------------------------------------

Hi Tim,

For the shell script, can we parameterize that a bit more?  As in pass in the prep dir and
output dir?

Also, s3cmd is GPL, so we can't include it, but we should at least document that it is required
for this script to work.  Perhaps we could use s3-curl which is BSD and does the same thing
and could be bundled in?

Thanks,
Grant

> Benchmark Mahout's clustering performance on EC2 and publish the results
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-588
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-588
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Grant Ingersoll
>         Attachments: 60_clusters_kmeans_10_iterations_100K_coordinates.txt, MAHOUT-588.patch,
MailArchivesClusteringAnalyzer.java, MailArchivesClusteringAnalyzerTest.java, SequenceFilesFromMailArchives.java,
SequenceFilesFromMailArchives.java, SequenceFilesFromMailArchives2.java, SequenceFilesFromMailArchivesTest.java,
TamingAnalyzer.java, TamingAnalyzer.java, TamingAnalyzerTest.java, TamingCollocDriver.java,
TamingCollocMapper.java, TamingDictVect.java, TamingDictionaryVectorizer.java, TamingGramKeyGroupComparator.java,
TamingSubset.java, TamingSubsetMapper.java, TamingTFIDF.java, TamingTokenizer.java, Top1000Tokens_maybe_stopWords,
Uncompress.java, clusters1.txt, clusters_kMeans.txt, distcp_large_to_s3_failed.log, ec2_setup_notes.txt,
ec2_setup_notes_v2.txt, ec2_setup_notes_v2.txt, mahout-588_canopy.pdf, mahout-588_distribution.pdf,
prep_asf_mail_archives.sh, prep_asf_mail_archives.sh, seq2sparse_small_failed.log, seq2sparse_xlarge_ok.log
>
>
> For Taming Text, I've commissioned some benchmarking work on Mahout's clustering algorithms.
 I've asked the two doing the project to do all the work in the open here.  The goal is to
use a publicly reusable dataset (for now, the ASF mail archives, assuming it is big enough)
and run on EC2 and make all resources available so others can reproduce/improve.
> I'd like to add the setup code to utils (although it could possibly be done as a Vectorizer)
and the publication of the results will be put up on the Wiki as well as in the book.  This
issue is to track the patches, etc.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message