mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Input on PTD dataset results
Date Mon, 26 Apr 2010 19:14:35 GMT
On Mon, Apr 26, 2010 at 10:49 AM, Ken Krugler
<kkrugler_lists@transpac.com>wrote:

> Hi all,
>
> I'm looking for input on two questions about the raw data files from the
> Public Terabyte Dataset project:
>
> 1. Target file size. What's the biggest file size that people would want to
> handle?
>
> E.g. we could generate 1000 chunks of 1GB each, or 100 chunks of 10GB, etc.
>

I like chunks < 1GB if only because moving them over a network involves less
wasted effort for failed transfers.


> 2. Any value to specific grouping of data in files?
>
> E.g. we could try to ensure that all data from the same domain goes into
> the same file.
>
> But that might result in individual data files having more skew, and thus
> make it harder to get useful results from processing a subset of the data.
>

Exactly.  I would find skewed data a pain the butt for statistical analysis.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message