mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric <>
Subject processing compressed files
Date Thu, 12 Sep 2013 16:41:06 GMT

I'd like to use Mahout for clustering and classification where I have tens of 
terabytes of data on Amazon's S3 storage service.  Each file in my data will 
generate one data point where I need to decompress the file and process it 
prior to applying machine learning.  Is it necessary to have all the files 
pre-processed prior to using Mahout or is there a straightforward way to 
combine the pre-processing with Mahout?  For example, I have a script that 
does the preprocessing and I somehow tell Mahout to run the script.

Pre-processing the files prior to running Mahout is simple, but Amazon 
charges for the extra storage space the pre-processed files would use.



View raw message