crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Crunch, Mahout, and HCatalog
Date Fri, 22 Mar 2013 16:37:19 GMT
Hey all,

I'm working on some tools for doing data integration and building machine
learning models w/Crunch, Mahout, and (soon!) HCatalog, and I wrote about
what I'm up to here:

http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/

and the code is here: https://github.com/cloudera/ml

I wanted to answer a couple of questions preemptively, if you don't mind:

Q: Why?
A: I started planning out the next version of my data science course, and I
was concerned that my students were going to spend too much time on data
integration tasks (e.g., converting CSVs to Vectors) that really should be
automated. I obviously enjoy writing my Java MR stuff in Crunch, and I
thought it would be a good idea to open source the tools to showcase how
awesome Crunch can be.

Q: Why not do this as part of the Crunch or Mahout projects?
A: Dependency management. Crunch doesn't depend on Mahout, and Mahout
doesn't depend on Crunch, and I think that for the sanity of the developers
of both projects, it should stay that way. Dependency management is already
enough of a nightmare for Hadoop projects that I didn't want to do anything
to make it worse. I will contribute anything from the toolkit back to
Crunch that is deemed useful by the community (e.g., the reservoir sampling
stuff in CRUNCH-178) and doesn't introduce any new dependencies.

Q: Where is this going?
A: I'm going to be co-developing the tools and the coursework for the
class, so I have a reasonably good idea of what features I need to add,
with HCatalog integration and ensemble models being the two major items on
the TODO list. I'm not looking to build a tool for every ML algorithm ever
invented, just some a small set of core models that are easy to use, easy
to tune, and thus easy for new data scientists to get started with.

If there's anything else folks are curious about, please just let me know
and I'd be happy to answer.

Josh

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message