mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From prasenjit mukherjee <prasen....@gmail.com>
Subject Re: plsi in pig
Date Wed, 11 Feb 2009 15:09:59 GMT
So I created a jira-issue :
https://issues.apache.org/jira/browse/MAHOUT-106 and also submitted a
patch along with readme instructions. Please feel free to try out with
different input samples. The default behaviour is to run pig in local
mode. Appreciate any suggestions/reviews.

-Prasen

On Wed, Feb 11, 2009 at 5:32 PM, Grant Ingersoll <gsingers@apache.org> wrote:
> This is excellent, Prasen.
>
> I see no reason not to include them.  We are about ML first,
> distributed/scalable ML second and Hadoop-based third, IMO.  Java would be a
> distant fourth in my mind.  In other words, I don't feel particularly strong
> about us being Java only or even Hadoop only.  To me there is a significant
> need for community-developed machine learning capabilities with a commercial
> friendly license.  Add in the ability to scale/run efficiently and you have
> a home run.  In fact, those are the very reasons we founded Mahout.
>
>
> On Feb 11, 2009, at 6:40 AM, prasenjit mukherjee wrote:
>
>> Pig is a higher level language ( more like Swazall for Google's
>> mapreduce )  on top of hadoop which makes hadoop easy to use.
>>
>> It has SQL like syntaxes and can break the command into separate
>> mapreduce tasks and also chain them. From execution point of view they
>> are as simple as running a shell script with very few
>> operators/commands.
>>
>> Some of its commands are join, group, cogroup, load etc.
>>
>> For example the following pig script  takes a logfile in the format :
>> <txid>,<txt>,<user> and outputs user-term-freq  file in the foll
>> format : <txt>\t<user>\t<cnt>
>>
>> raw = load 'tx_log.csv' using PigStorage(',') AS
>> (transactionid:chararray, txt:chararray, user:chararray);
>> tokenized = FOREACH raw GENERATE user, flatten(TOKENIZE(txt)) as
>> attribute;
>> user_term_freq = group tokenized by (user,attribute);
>> user_term_freq = foreach ratings generate flatten(group),COUNT(tokenized);
>> store ratings into 'user_term_freq.txt';
>>
>> During runtime pig takes the input and breaks it into several map and
>> reduce tasks. It takes the hadoop-site.xml from its classpath.
>>
>> -Prasen
>>
>> On Wed, Feb 11, 2009 at 4:54 PM, Sean Owen <srowen@gmail.com> wrote:
>>>
>>> Needs to go somewhere like trunk/core/src/pig/main right, versus /java/ ?
>>>
>>> I also see no harm in adding it, other than that it would remain
>>> pretty isolated right? isn't part of the build, can't be integrated
>>> with the other code, etc.? Does it add value to package it with the
>>> project then?
>>>
>>> Perhaps I misunderstand what Pig can do or how it can relate to Java?
>>>
>>> On Wed, Feb 11, 2009 at 11:13 AM, Grant Ingersoll <gsingers@apache.org>
>>> wrote:
>>>>
>>>> Hmm, hadn't really thought about it, but I see no reason why we wouldn't
>>>> accept it and add it.  I think our source tree can definitely handle it.
>>>>
>>>> I'd propose it go somewhere under:
>>>> trunk/core/src/main/pig/plsi
>>>>
>>>> I'm not familiar with Pig, but I can learn, and I know others are, is it
>>>> a
>>>> single file?
>>>>
>>>> See http://cwiki.apache.org/MAHOUT/howtocontribute.html for instructions
>>>> on
>>>> contributing.  Basically, just attach the file(s) to a JIRA issue.
>>>>
>>>> On Feb 11, 2009, at 2:18 AM, prasenjit mukherjee wrote:
>>>>
>>>>> Hi,
>>>>> I have implemented hofmann's plsi/em algo in pig which I would like
>>>>> to contribute back to the community for further
>>>>> scrutinization/improvement.  Let me know if mahout is the appropriate
>>>>> forum or should  it go to  pig project.
>>>>>
>>>>> Haven't  seen any non-java contributions to Mahout yet, which begs the
>>>>> question is Mahout only java based ?
>>>>>
>>>>> -Thanks,
>>>>> Prasen
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>>
>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>>> Solr/Lucene:
>>>> http://www.lucidimagination.com/search
>>>>
>>>>
>>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Mime
View raw message