lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thorsten Scherler <thors...@apache.org>
Subject Re: [jira] Lius into apache incubator
Date Thu, 01 Mar 2007 23:00:22 GMT
Renaud forwarded me the thread and I just subscribed, 
so apologize for not proper responding.

Thanks Renaud for the headsup.

> Hi,
> 
> On 3/1/07, Grant Ingersoll <gsingers@apache.org> wrote:
> > Is the Droids lab at all related to that parsing project in Nutch?
> 
> Partly, yes. I've been looking at Droids and so far I think it's main
> focus has been on the crawling part rather than on the analysis of
> retrieved content. 

Yes, droids should be a generic crawler framework. I took Nutch and 
ripped out the plugin/extension point framework and wrote some PoC plugins. 
I changed many thinks to make the code simpler so from Nutch original code 
is not much left. Further I am using ivy for dependencies management for 
the core and the plugins.

The first crawler is not close to the one from nutch but via plugins one
could implement the same functionality (but there is ATM no interest on
Nutch). The implemented crawler x-m02y07 is more (very basic for now)
wget style -> request url, extract links and save the page to disk.

> A generic content analysis toolkit would likely be
> a great companion for Droids.

Yes indeed. I am ATM playing with 
http://simile.mit.edu/repository/crowbar/trunk/ 

Stefano pointed me to it and it is very interesting since the idea is 
to use a gecko based browser as server to browse a page and let the 
browser analyze the page. Very interesting since it enables crawler to 
index web2 components such as ajax.

http://mail-archives.apache.org/mod_mbox/labs-labs/200702.mbox/browser

The core for any crawler is the link recognition where we can go different routes. In
the short term we can enhance the parse-html droids plugin with neko
html (similar route as nutch is going) but in the long run we should try
to incorporate a virtual browser like Stefano pointed out on the labs
ml.

>  In fact I was earlier contemplating
> about starting a related effort in Apache Labs (see
> http://issues.apache.org/jira/browse/JCR-728),

That seems more to aim to close the mime type gap that we have ATM and I 
think labs would be the right place for this.

>  but there seems to be
> enough demand for such functionality that a more full-fledged project
> might be better.

Maybe you are interested in starting some plugins in Droids and as soon 
we got some community around the code we can request for incubation. 

Some Forrest folks also expressed their interest in Droids. 
Actually Forrest/cocoon was one of the main reason I started it.
The other was Solr.


> 
> > There seems to be several efforts that are related here that could
> > probably make for a nice new project under Lucene, IMO.  They all
> > seem to have to do with getting and preparing text for processing by
> > some type of consumer of text.
> 
> Exactly. It would be great to see some consolidation of efforts.
> 

The grant advantage of labs is that all apache committer have write
access meaning cross project efforts like this one are perfect to get
started in labs. If enough people get attracted the lab get promoted. 
When a lab is promoted, the files are moved over to the incubation area.
http://labs.apache.org/bylaws.html

Looking forward to see you on
labs@labs.apache.org

salu2
-- 
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java & XML                consulting, training and solutions


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message