mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matteo Riondato (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAHOUT-980) Patch to make PFPGrowth run on Amazon MapReduce (also shows possible pattern to make other algorithms work in Amazon MapReduce)
Date Wed, 29 Feb 2012 22:19:57 GMT

     [ https://issues.apache.org/jira/browse/MAHOUT-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Matteo Riondato updated MAHOUT-980:
-----------------------------------

    Attachment: PFPGrowth.java.diff

The patch.
                
> Patch to make PFPGrowth run on Amazon MapReduce (also shows possible pattern to make
other algorithms work in Amazon MapReduce)
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-980
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-980
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.5, 0.6, 0.7
>         Environment: Amazon MapReduce
>            Reporter: Matteo Riondato
>              Labels: hadoop, patch
>             Fix For: 0.7
>
>         Attachments: PFPGrowth.java.diff
>
>
> The patch at http://www.cs.brown.edu/~matteo/PFPGrowth.java.diff (against trunk as of
Wed Feb 22 00:07:35 EST 2012, revision 1292127) makes it possible to run PFPGrowth on Elastic
MapReduce. 
> The problem was in the way the fList stored in the DistributedCache was accessed. DistributedCache.getCacheFiles(conf)
should be reserved for internal use according to the Hadoop API Documentation. The suggested
way to access the files in the DistributedCache is through DistributedCache.getLocalCacheFiles(conf)
and then through a LocalFilesystem. This is what the patch does. Note that there is a fallback
case if we are running PFPGrowth with "-method mapreduce" but locally (e.g. when HADOOP_HOME
is not set or MAHOUT_LOCAL is set). In this case, we use DistributedCache.getCacheFiles()
as it is done in the unpatched version.
> A quick grep in the source tree shows that there are other places where DistributedCache.getCacheFiles(conf)
is used. It may be worth checking whether the corresponding algorithms can be made to work
in Amazon MapReduce by fixing them in a similar fashion.
> The patch was tested also outside Amazon MapReduce and does not change any other functionality.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message