nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Mattmann" <chris.mattm...@jpl.nasa.gov>
Subject RE: [Proposal] New Lucene sub-project
Date Mon, 24 Apr 2006 18:26:38 GMT
Hi Otis,

> This thread seems to have gotten very little attention.
> Jérôme - I'm all for extracting sub-libraries that can really live on its
> own and are substantial enough to warrant "their own identity".
> 
> Personally, I'm the most interested in Language Identifier plugin becoming
> a standalone, Nutch-independent piece.  Doug had suggested we move it to
> Lucene's contrib section.  If you think it makes sense to have some of
> these things lumped together, that's fine, too.  It looks like Language
> Identifier and Charset Detector may go well together.
> 
> Is this something you want/will push for and make happen?

Just to add to this, it's something that I would push for whole-heartedly.
In addition to Jerome, I would be happy to dedicate time to this
sub-project, and feel it's quite worthy of being its own Stand-alone
library. 

Just my two cents, thanks!

Cheers,
  Chris


> 
> Otis
> 
> ----- Original Message ----
> From: Jérôme Charron <jerome.charron@gmail.com>
> To: nutch-dev@lucene.apache.org
> Sent: Friday, April 7, 2006 4:26:54 AM
> Subject: [Proposal] New Lucene sub-project
> 
> Hi all,
> 
> While chatting with Chris Mattmann, it seems to be evident to us that
> there
> is a need for a new sub-project within Lucene.
> 
> For now, Lucene's sub-projects used in Nutch are :
> 1. Lucene-java - The basis for search technology
> 2. Hadoop - The distributed computing platform
> 3. Nutch - The search engine that relies on Lucene and Hadoop.
> 
> Since Nutch contains some value added pieces of code that focus on content
> analysis,
> we think it would be a good idea to split Nutch into a new sub-project
> based
> on content analysis
> manipulation. The components we have identified are :
> 
> 1. MimeType Repository
> 2. Language Identifier
> 3. Content Signature (MD5Signature / TextProfileSignature / ...)
> (4. Generic Meta Data Infrastructure)
> (5. Charset Detector)
> (6. Parse Plugins Framework)
> 
> The idea is to expose these pieces of codes into a standalone lib, since
> we
> are convinced they could be usefull
> in many other projects than Nutch.
> The benefits will be to have some code more widely used / tested /
> contributed.
> If this proposal is accepted, we have a candidate name for this new
> project:
> Tika (comes from my son  ;-) )
> 
> Any comment is welcome.
> 
> Jérôme
> 



Mime
View raw message