mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Mahout interest at Berlin universities
Date Mon, 06 Jul 2009 16:54:40 GMT

On Jul 5, 2009, at 3:37 PM, Isabel Drost wrote:

> People seem to slowly become aware that there is something named  
> Hadoop that
> implements a framework for parallel programming once developed at  
> Google.
> However the basic assumptions and implications (e.g. data locality)  
> are known
> only by few groups/ people at least in the IR and data mining domains.

This is always the case with new things.  It is impossible to keep up  
with all the things happening.  It's why it is important to keep  
trying to raise visibility like we are doing.

FWIW, I see the same here, although Hadoop has a lot of buzz right now.

> Anytime I asked people using Apache software as to whether they are  
> subscribed
> to the corresponding user mailinglist the answer was a questioning  
> face and
> no as an answer. I tried to make clear why participation is  
> important - I
> guess we will see in the near future whether I was successful ;)

Participation takes a whole other level of commitment.   People need  
to be able to quickly see the benefit or be willing to be on the  
cutting edge.  It's hard to join a project in the early stages because  
it may very well be the case that the project doesn't make it.  I  
think the ASF raises the chances of success, but it doesn't guarantee  

> I was surprised to see people only vaguely aware of the GSoC  
> program. They
> knew that it does exist, but the general setup was not as widely  
> known as I
> would have expected it to be. After all in our GSoC proposals there  
> seemed to
> be quite a few students co-supervised by their university.

GSOC is relatively small, so I don't find it that surprising.  And,  
they cut back this year, too.

> Concerning Mahout I got varying feedback: There were a few that had  
> a look at
> it last autumn that found it difficult to find the sourcecode and
> documentation. Some students had a look shortly after Apache Con EU  
> this year
> and found it hard to setup a demo application. I think having some  
> JavaDoc,
> tutorial, setup sort of documentation for each release version on  
> our website
> might help people getting started easier?

I've been working on this a lot lately and agree it is important for  
us for 0.2.  Some rework of the landing web page to include quicker  
links to source, etc. would be helpful.

Having some sites in production will also be useful, once we get  
there.  All in good time.  The key right now is for us committers to  
make sure we are reviewing patches, improving the code and helping new  
contributors feel welcome and help them become committers when  

> Other than that general feedback seemed to be that we are doing  
> "surprisingly
> well" both in terms of emerging community and in terms of  
> implementation
> progress over the first year.


> Last but not least: From DIMA at TU Berlin I received the offer to do
> a "Mahout seminar". It would consist of two parts: A theoretical one  
> where
> students read scientific publications, prepare a survey and give a  
> talk by
> the end of the semester. The other part would be a project where  
> they could
> work for instance on some algorithm implementation or integrate  
> already
> existing implementations in a project. Goal would be to strengthen  
> their
> programming and project management skills and along the way make them
> contribute back to the community.


> My first thought was to prepare a task with the goal of building a new
> blog "search engine". They could build a system that identifies  
> clusters of
> blogs on a common topic, work on the link graph in the blogosphere,  
> detect
> new emerging topics and the like. Before preparing the final seminar
> proposal, I would like to ask you whether there is anything you  
> might want
> those students to work on during their winter-term.

That sounds pretty involved to get done in a semester, but maybe it  
depends on the level of student.  I could also see things like  
benchmarking, setting up clusters and running/tuning.  Creating demos,  
etc.  In other words, let them do a couple of projects.

View raw message