incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "TikaProposal" by JukkaZitting
Date Wed, 07 Mar 2007 16:21:26 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The following page has been changed by JukkaZitting:
http://wiki.apache.org/incubator/TikaProposal

The comment on the change is:
Prepared the proposal page for public announcement

------------------------------------------------------------------------------
  = Tika, a content analysis toolkit =
  
- This is a draft version of the Tika proposal. Feel free to edit as you see fit, see also
the discussion section at the end of the page. You can "Subscribe" this page to get notified
whenever changes are made. See the [http://incubator.apache.org/guides/proposal.html proposal
guide] for a description of the expected proposal content.
+ ''This is a draft version of the Tika proposal. Please use the Incubator general mailing
list or the separate TikaProposalDiscussion page to discuss this proposal. You can "Subscribe"
this page to get notified whenever changes are made. See the [http://incubator.apache.org/guides/proposal.html
proposal guide] for a description of the expected proposal content.''
  
- See also the [http://thread.gmane.org/gmane.comp.search.nutch.devel/9684/focus=9693 earlier
proposal draft] by Chris Mattmann and Jerome Charron.
+ ''See also the [http://thread.gmane.org/gmane.comp.search.nutch.devel/9684/focus=9693 earlier
proposal draft] by Chris Mattmann and Jerome Charron.''
  
  === Abstract ===
  
@@ -16, +16 @@

  
  The toolkit is targeted especially for search engines and other content indexing and analysis
tools, but will be useful also for other applications that need to extract meaningful information
from documents that might be presented as nothing else than binary streams.
  
- Instead of implementing it's own document parsers, Tika will use existing parser libraries
like [http://jakarta.apache.org/poi/ Jakarta POI] and [http://www.pdfbox.org/ PDFBox].
+ Instead of implementing its own document parsers, Tika will use existing parser libraries
like [http://jakarta.apache.org/poi/ Jakarta POI] and [http://www.pdfbox.org/ PDFBox].
  
  === Background ===
  
@@ -52, +52 @@

  
  === Community ===
  
- There is not yet a clear Tika community. Instead we have a number of people and related
projects with an understanding that a shared toolkit project would best serve everyone's interests.
The primary goal of the incubating project is to build a a self-sustaining community around
this shared vision.
+ There is not yet a clear Tika community. Instead we have a number of people and related
projects with an understanding that a shared toolkit project would best serve everyone's interests.
The primary goal of the incubating project is to build a self-sustaining community around
this shared vision.
  
  === Core Developers ===
  
@@ -86, +86 @@

  
  Tika is related to at least the following Apache projects. None of the projects is a direct
competitor for Tika, but there are many cases of potential overlap in functionality.
  
-  * [http://lucene.apache.org/java/ Apache Lucene] - The analysis part of Lucene contains
code that might overlap with some of the potential Tika functionality. There migth also be
some overlap regarding the Document model in Lucene.
+  * [http://lucene.apache.org/java/ Apache Lucene] - The analysis part of Lucene contains
code that might overlap with some of the potential Tika functionality. There might also be
some overlap regarding the Document model in Lucene.
   * [http://lucene.apache.org/nutch/ Lucene Nutch] - The Nutch project already contains a
parser framework that does many of the things that Tika is designed to do.
   * [http://jackrabbit.apache.org/ Apache Jackrabbit] - The Jackrabbit project contains a
text extraction component that also implements a subset of the proposed Tika features.
-  *  [http://incubator.apache.org/uima/ Apache UIMA] - The UIMA project provides a framework
and pluggable tools for analyzing text content and extracting information. Example tools include
language identification, sentence boundary detection and "entity extraction" - finding  references
to people, places and organisations. TIKA could be used by UIMA to parse text but TIKA should
be careful not to duplicate the subsequent text analysis features UIMA offers.
+  *  [http://incubator.apache.org/uima/ Apache UIMA] - The UIMA project provides a framework
and pluggable tools for analyzing text content and extracting information. Example tools include
language identification, sentence boundary detection and "entity extraction" - finding references
to people, places and organisations. Tika could be used by UIMA to parse text but Tika should
be careful not to duplicate the subsequent text analysis features UIMA offers.
  
  === A Excessive Fascination with the Apache Brand ===
  
@@ -112, +112 @@

  
  Tika will start with a combination of seed code from the efforts listed below:
  
-  * The [http://lucene.apache.org/nutch Apache Nutch] project, that contains a parser framework
and various content analysis tools
+  * The [http://lucene.apache.org/nutch Apache Nutch] project that contains a parser framework
and various content analysis tools
   * The [http://sourceforge.net/projects/lius/ Lius project], an indexing framework for Apache
Lucene
-  * The [http://jackrabbit.apache.org/ Apache Jackrabbit] project, that contains a text extraction
component
+  * The [http://jackrabbit.apache.org/ Apache Jackrabbit] project that contains a text extraction
component
  
  No existing codebase is selected as "the" starting point of Tika to avoid inheriting the
world view and design limitations of any single project.
  
@@ -128, +128 @@

  
  Tika will depend on a number of external parser libraries with various licensing conditions.
An initial list of potential dependencies is shown below.
  
- || '''Library'''                                                       || '''License'''
       ||
+ || '''Library'''                                                       || '''License'''
        ||
- || [http://jakarta.apache.org/poi/ Jakarta POI]                        || ASLv2        
       ||
+ || [http://jakarta.apache.org/poi/ Jakarta POI]                        || ASLv2        
        ||
- || [http://www.pdfbox.org/ PDFBox]                                     || BSD          
       ||
+ || [http://www.pdfbox.org/ PDFBox]                                     || BSD          
        ||
- || [http://people.apache.org/~andyc/neko/doc/html/index.html NekoHTML] || CyberNeko (like
ASL) ||
+ || [http://people.apache.org/~andyc/neko/doc/html/index.html NekoHTML] || !CyberNeko (like
ASL) ||
- || [http://jtidy.sourceforge.net/ JTidy]                               || W3C          
       ||
+ || [http://jtidy.sourceforge.net/ JTidy]                               || W3C          
        ||
  
  There are also some LGPL parser libraries that would be useful. Whether and how such dependencies
could be handled will be discussed during incubation. No such dependencies will be added to
the project before the legal implications have been cleared.
  
@@ -171, +171 @@

  
  == Affiliations ==
  
- || '''Name'''        || '''Affiliation'''                       ||
+ || '''Name'''        || '''Affiliation'''                                         ||
- || Jukka Zitting     || [http://www.day.com/ Day Management AG] ||
+ || Rida Benjelloun   || [http://www.doculibre.com/index_en.html Doculibre inc.]   ||
  || Chris A. Mattmann || [http://www.jpl.nasa.gov/ NASA Jet Propulsion Laboratory] ||
+ || Jukka Zitting     || [http://www.day.com/ Day Management AG]                   ||
- ||Rida Benjelloun    || [http://www.doculibre.com/index_en.html Doculibre inc.] ||
- 
  
  == Sponsors ==
  
@@ -192, +191 @@

  
   * ''TODO: Apache Lucene (vote needed)''
  
- ----
- 
- == Discussion ==
- 
-  * Use this area for discussing the contents of the proposal. - Jukka Zitting
- 
- I am not sure if there are already some existing apache projects dealing with this licensing
issue, but how do you see option where Tika would be using a build system like maven2 and
would release through maven repository -> no offending code/libraries in repository nor
in releases. Of the downside (atm) of this is that not all of those libs are available from
common m2 repos. - Sami Siren
- 
- AFAIUI we can't even '''import''' any (L)GPL classes in Apache code. See the draft [http://people.apache.org/~cliffs/3party.html
Third-Party Licensing Policy] page for the details. One way to work around this problem (and
IMHO a good solution generally) would be to provide some SPI interface in Tika and use a service
provider mechanism to dynamically bind to all the implementations available at runtime. This
would invert the compile-time dependencies and a user would only need to add the parser libraries
whose functionality is needed as extra dependencies in addition to Tika. - Jukka Zitting
- 
- AFAIU the importing is the only remaining "issue" currently (in addition to politics), if
the proposed changes "go through" it is not anymore. By SPI you probably don't mean javas
standard SPI mechanism, because that would need those libs to implement Tikas not yet existing
interfaces? - Sami Siren
- 
- Yeah, I'm also hoping for a reasonable solution to the LGPL issue, that's one of the reasons
for listing also LGPL libraries above and leaving the the actual dependency policy open to
be decided during incubation. Re: SPI; yes, that was my intention. Of course we can't expect
to get such phantom interfaces implemented yet, but we could well start with small separately
released integration components and advocate getting them included in the upstream libraries
once Tika starts gaining more recognition. - Jukka Zitting
- 
- I don't like listing LGPL'd libraries as dependencies.  It raises a red flag and thus invites
lengthy, non-productive discussions.  We will adhere to Apache's Third-Party Licensing Policy,
which means we cannot require LGPL'd stuff.  A "dependency" sounds like a requirement.  So,
while we'll probably permit the use of some optional LGPL'd libraries, let's not list them
as dependencies, ok?  -- DougCutting
- 
- OK, removed the LGPL libraries. - Jukka Zitting
- 

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message