incubator-any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
Subject Re: [DISCUSS] Questions on Basic-Crawler Module
Date Sat, 14 Jan 2012 16:35:04 GMT
Hi Michele,

I was thinking about replying to my original thread with some of the points
you make as I completely agree with your logic. Simone also mention the
importance of keeping the basic-crawler as a plugin and I agree with this
aswel.

Once we get the Any23 packages changed to o.a.any23 rather than
a.deri.any23, this will allow us to push it to apache nexus, I'll begin
work on the Nutch-Any23 plugin. We'll take it from there.

Thanks for getting back to me with your thoughts.

On Sat, Jan 14, 2012 at 3:39 PM, Michele Mostarda <
michele.mostarda@gmail.com> wrote:

> On 13 January 2012 14:21, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com
> >wrote:
>
> > Further to this, the Basic crawler plugin took some 4 mins to download
> > dependencies, install and test...
> >
> > Seems a lot of overhead for a plugin which is not even mentioned in the
> > project description. Considering the overall build took some 8 mins
> > locally.
> >
>
> The Crawler plugin has been added with milestone 0.7.0, the documentation
> has not yet written.
>
> Mic
>
>
> >
> > ...
> >
> > On Fri, Jan 13, 2012 at 1:16 PM, Lewis John Mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> > > Hi Guys,
> > >
> > > OK further to my ridiculous question regarding where the module
> actually
> > > is, I would like to pose some more relevant thoughts.
> > >
> > > A while ago I opened NUTCH-1129 [1], based enitrely on the suggestion
> > > which was included within the Incubator proposal for a Nutch Any23
> > plugin.
> > > As you know, currently the crawling in the basic-crawler plugin is done
> > via
> > > crawler4j, @ Apache we are great believers of eat your own dog food,
> > > therefore my proposal would be to remove the dependencies on crawler4j
> > if I
> > > was building the Nutch implementation using instead Nutch interfaces
> and
> > > functionality. This kind of leads on to my question as to
> > >
> > > 1) Should the basic-crawler plugin be kept within Any23? My own
> thoughts
> > > are that it provides a real nice and easy way to test out Any23
> > > functionality, however should 'crawling' functionality be part of a
> > project
> > > which describes itself as "a library, a web service and a command line
> > tool
> > > that extracts structured data in RDF format from a variety of Web
> > > documents."?
> > > 2) The knock-on effect of removing this module and porting it directly
> to
> > > Nutch would be that to test out Any23 libraries within a crawler you
> > would
> > > need a working knowledge of Nutch... this could be putting up barriers
> to
> > > adoption...
> > > 3) I'm assuming that a Nutch plugin would simply use Ivy to pull the
> > > any23-core library from the Apache repo and use this, I'm thinking of
> > > deduplicating as much code as possible between projects... Any ideas
> > >
> > > Thanks
> > >
> > > [1] https://issues.apache.org/jira/browse/NUTCH-1129
> > >
> > > --
> > > *Lewis*
> > >
> > >
> >
> >
> > --
> > *Lewis*
> >
>
>
>
> --
> Michele Mostarda
> Senior Software Engineer
> skype: michele.mostarda
> twitter: micmos
> mail: me@michelemostarda.com
> site : http://www.michelemostarda.com
>



-- 
*Lewis*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message