mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Timothy Potter (JIRA)" <>
Subject [jira] Commented: (MAHOUT-588) Benchmark Mahout's clustering performance on EC2 and publish the results
Date Sun, 30 Jan 2011 17:19:49 GMT


Timothy Potter commented on MAHOUT-588:

Hi Sean,

Will definitely look into updating the Wiki as I work through the process.

Thanks for the heads-up on the 5TB limit -- it might be how distcp was accessing S3 as I definitely
got back a "max file size exceeded" error from S3 when trying to upload files larger than
5GB. Will do some more research to find out the exact cause ...

> Benchmark Mahout's clustering performance on EC2 and publish the results
> ------------------------------------------------------------------------
>                 Key: MAHOUT-588
>                 URL:
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Grant Ingersoll
>         Attachments: distcp_large_to_s3_failed.log, seq2sparse_small_failed.log, seq2sparse_xlarge_ok.log,,,
> For Taming Text, I've commissioned some benchmarking work on Mahout's clustering algorithms.
 I've asked the two doing the project to do all the work in the open here.  The goal is to
use a publicly reusable dataset (for now, the ASF mail archives, assuming it is big enough)
and run on EC2 and make all resources available so others can reproduce/improve.
> I'd like to add the setup code to utils (although it could possibly be done as a Vectorizer)
and the publication of the results will be put up on the Wiki as well as in the book.  This
issue is to track the patches, etc.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message