mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Input on PTD dataset results
Date Mon, 26 Apr 2010 19:14:35 GMT
On Mon, Apr 26, 2010 at 10:49 AM, Ken Krugler

> Hi all,
> I'm looking for input on two questions about the raw data files from the
> Public Terabyte Dataset project:
> 1. Target file size. What's the biggest file size that people would want to
> handle?
> E.g. we could generate 1000 chunks of 1GB each, or 100 chunks of 10GB, etc.

I like chunks < 1GB if only because moving them over a network involves less
wasted effort for failed transfers.

> 2. Any value to specific grouping of data in files?
> E.g. we could try to ensure that all data from the same domain goes into
> the same file.
> But that might result in individual data files having more skew, and thus
> make it harder to get useful results from processing a subset of the data.

Exactly.  I would find skewed data a pain the butt for statistical analysis.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message