incubator-any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michele Mostarda <michele.mosta...@gmail.com>
Subject Re: [DISCUSS] Questions on Basic-Crawler Module
Date Sat, 14 Jan 2012 15:39:02 GMT
On 13 January 2012 14:21, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com>wrote:

> Further to this, the Basic crawler plugin took some 4 mins to download
> dependencies, install and test...
>
> Seems a lot of overhead for a plugin which is not even mentioned in the
> project description. Considering the overall build took some 8 mins
> locally.
>

The Crawler plugin has been added with milestone 0.7.0, the documentation
has not yet written.

Mic


>
> ...
>
> On Fri, Jan 13, 2012 at 1:16 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > Hi Guys,
> >
> > OK further to my ridiculous question regarding where the module actually
> > is, I would like to pose some more relevant thoughts.
> >
> > A while ago I opened NUTCH-1129 [1], based enitrely on the suggestion
> > which was included within the Incubator proposal for a Nutch Any23
> plugin.
> > As you know, currently the crawling in the basic-crawler plugin is done
> via
> > crawler4j, @ Apache we are great believers of eat your own dog food,
> > therefore my proposal would be to remove the dependencies on crawler4j
> if I
> > was building the Nutch implementation using instead Nutch interfaces and
> > functionality. This kind of leads on to my question as to
> >
> > 1) Should the basic-crawler plugin be kept within Any23? My own thoughts
> > are that it provides a real nice and easy way to test out Any23
> > functionality, however should 'crawling' functionality be part of a
> project
> > which describes itself as "a library, a web service and a command line
> tool
> > that extracts structured data in RDF format from a variety of Web
> > documents."?
> > 2) The knock-on effect of removing this module and porting it directly to
> > Nutch would be that to test out Any23 libraries within a crawler you
> would
> > need a working knowledge of Nutch... this could be putting up barriers to
> > adoption...
> > 3) I'm assuming that a Nutch plugin would simply use Ivy to pull the
> > any23-core library from the Apache repo and use this, I'm thinking of
> > deduplicating as much code as possible between projects... Any ideas
> >
> > Thanks
> >
> > [1] https://issues.apache.org/jira/browse/NUTCH-1129
> >
> > --
> > *Lewis*
> >
> >
>
>
> --
> *Lewis*
>



-- 
Michele Mostarda
Senior Software Engineer
skype: michele.mostarda
twitter: micmos
mail: me@michelemostarda.com
site : http://www.michelemostarda.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message