mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: GSoC 2009-Discussion
Date Mon, 23 Mar 2009 21:40:27 GMT

Mmmm.... :)  This would definitely be very useful to anyone dealing with web page parsing
and indexing.

Sematext -- -- Lucene - Solr - Nutch

----- Original Message ----
> From: Samuel Louvan <>
> To:
> Sent: Sunday, March 22, 2009 7:17:11 PM
> Subject: GSoC 2009-Discussion
> Hi,
> I just browsed through the idea list in GSoC 2009 and I'm interested
> to work in Apache Mahout.
> Currently, I'm doing my master project in my university related to
> machine learning + information retrieval. More specifically
> it's about how to discover informative content in a web page by using
> machine learning approach.
> Overall, there are two stages for doing this task, namely web page
> segmentation and locating the informative content.
> Web page segmentation process, takes a DOM tree representation of a
> HTML document and then group the DOM nodes
> into certain granularity. Next, a classification task is performed to
> the DOM nodes into binary class whether it is
> a informative content or non-informative content. The features used
> for the classification are for example, inner HTML length,
> inner Text Length, stop word ratio, offsetHeight, coordinate of the
> HTML element on the browser etc.
> The dataset is generated by a labeling program that I made (for
> supervised learning). Basically, a user can
> select & annotate a particular segment of the web page and then mark
> the class label as a informative content or not informative content.
> I did some small experiments with this last semester, I played with
> WEKA and tried some algorithms namely Random forests,
> Decision tree, SVM, and Neural Network. In this experiment, random
> forest and decision tree yield the most satisfying result.
> Currently, I'm working on my master project and will implement a
> machine learning algorithm either decision tree or random forest
> for the classifier. For this reason, I'm very interested to work on
> Apache Mahout in this year's GSoC to implement one of those
> algorithm.
> My questions:
> - I just notice in the mailing archive that other student also pretty
> serious to implement random forest algorithm. Should I select
>   decision tree instead ? (for my future GSoC proposal)
> - Actually I found it would be interesting if I can combine Apache
> Nutch and Mahout so the idea is to implement web page segmentation +
> classifier inside
>   a web crawler. By doing this, a crawler, for instance, can use the
> output of the classification to  only follow certain links that lie on
> informative content parts.
>   Is this interesting & make sense for you guys?
> Maybe for more details, you can download my presentation slides and
> master project desription at
> A little bit background of me : I'm a 2nd year Master Student in TU
> Eindhoven, Netherlands.
> Last year I also participated in GSoC with OpenNMS
> (
> Looking forward for your feedback and input.
> Regards,
> Samuel L.

View raw message