mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Pig for preprocessing in Mahout?
Date Fri, 04 Apr 2008 20:01:28 GMT

On 4/4/08 12:39 PM, "Karl Wettin" <> wrote:

> Ted Dunning skrev:
>> I would say that it would be easier to use a system that has a full
>> extension language such as grool or JAQL than pig.  Resampling and
>> discretization are really pretty straightforward applications of map reduce
>> and should normally be collected as components into a larger composite
>> mapper.
> I was thinking we would use Pig as that larger composite mapper. If we
> wanted to add discretization to Mahout we would then add it to Pig. They
> seem to have a framework to do a lot of the things I want in a pre
> processing module.

I think that Pig would be moderately difficult to integrate as a component
in a full-scale ML framework and it would not be suitable as a map

You can write functions that operate on bags of records, but the integration
is likely to be pretty clunky and the resilience to errors in the components
would be nil (AFAIK).  Given that the new ML components are likely to be
less than robust, that makes the overall process painful.

> But I don't know Pig enough to say if that could work for all the things
> we might want to do at pre processing time with Mahout.

To the extent that it does, I don't see a problem with using pig to build
datasets and then running ML on those datasets.  Since pig is all about
batch processing, this isn't a big deal.

>> I should also have said that Pig is progressing very quickly.
> When do you think Pig might be "stable"?

No clue.  They have a large amount of invested effort so far and have had a
pretty long time in the current state, but I can't say when they will cross
a magical threshold that makes Pig easy enough to use for most users.  I get
the impression that it is beginning to reach that threshold inside Yahoo
where there is a strong evangelism network, but for outside users without a
strong interest in the internals, I think it is a ways away.

View raw message