incubator-any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michele Mostarda <>
Subject Re: [DISCUSS] Questions on Basic-Crawler Module
Date Sat, 14 Jan 2012 15:37:28 GMT
On 13 January 2012 14:16, Lewis John Mcgibbney <>wrote:

> Hi Guys,
> OK further to my ridiculous question regarding where the module actually
> is, I would like to pose some more relevant thoughts.
> A while ago I opened NUTCH-1129 [1], based enitrely on the suggestion which
> was included within the Incubator proposal for a Nutch Any23 plugin. As you
> know, currently the crawling in the basic-crawler plugin is done via
> crawler4j, @ Apache we are great believers of eat your own dog food,
> therefore my proposal would be to remove the dependencies on crawler4j if I
> was building the Nutch implementation using instead Nutch interfaces and
> functionality. This kind of leads on to my question as to
> 1) Should the basic-crawler plugin be kept within Any23? My own thoughts
> are that it provides a real nice and easy way to test out Any23
> functionality, however should 'crawling' functionality be part of a project
> which describes itself as "a library, a web service and a command line tool
> that extracts structured data in RDF format from a variety of Web
> documents."?

2) The knock-on effect of removing this module and porting it directly to
> Nutch would be that to test out Any23 libraries within a crawler you would
> need a working knowledge of Nutch... this could be putting up barriers to
> adoption...
> 3) I'm assuming that a Nutch plugin would simply use Ivy to pull the
> any23-core library from the Apache repo and use this, I'm thinking of
> deduplicating as much code as possible between projects... Any ideas
Trust me Lewis, the possibility to crawl the semantic content of a site
with a
single command is priceless, a lot of users asked to add crawler
to Any23.

However the crawling functionality requires specific (and immature)
that's why it has been implemented as a plugin.

I don't liked crawler4j, it required some dirty workarounds to be used in
the plugin,
 but it was the only library providing exactly what we needed for the
purpose of the Crawl CLI.

I completely agree with the idea of replacing crawler4j with some ASF
alternative, but at the
condition to keep it easy to use as a CLI.

The best.


> Thanks
> [1]
> --
> *Lewis*

Michele Mostarda
Senior Software Engineer
skype: michele.mostarda
twitter: micmos
site :

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message