lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "DocumentProcessing" by levitski
Date Wed, 15 Jun 2011 23:08:38 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DocumentProcessing" page has been changed by levitski:
http://wiki.apache.org/solr/DocumentProcessing?action=diff&rev1=15&rev2=16

  
  = Problem =
  Solr would benefit from a flexible document processing framework meeting the requirements
of enterprise grade content integration. Most search projects have some need for processing
the incoming content prior to indexing, for example:
+ 
   * Language identification
   * Text extraction (Tika)
   * Entity extraction and classification
@@ -16, +17 @@

  
  There are many processing pipeline frameworks from which to get inspiration, such as the
one in FAST ESP, [[http://www.openpipeline.org/|OpenPipeline]], [[http://openpipe.berlios.de/|OpenPipe]]
(now on [[https://github.com/kolstae/openpipe|GitHub]]), [[http://www.pypes.org/|Pypes]],
[[http://uima.apache.org/|UIMA]], [[http://www.eclipse.org/smila/|Eclipse SMILA]], [[http://commons.apache.org/sandbox/pipeline/|Apache
commons pipeline]], [[http://found.no/products/piped/|Piped]] and others. Indeed, some of
these are already being used with Solr as a pre-processing server. This means weak coupling
but also weak re-use of code. Each new project will have to choose which of the pipelines
to invest in.
  
- The community would benefit from an official processing framework -- and more importantly
an official repository of processing stages which are shared and reused. The sharing part
is crucial. If a company develops, say a Geo``Names stage to translate address into lat/lon,
the whole community can benefit from that by fetching the stage from the shared repository.
This will not happen as long as there is not one single preferred integration point.
+ The community would benefit from an official processing framework -- and more importantly
an official repository of processing stages which are shared and reused. The sharing part
is crucial. If a company develops, say a GeoNames stage to translate address into lat/lon,
the whole community can benefit from that by fetching the stage from the shared repository.
This will not happen as long as there is not one single preferred integration point.
  
- There have recently been interest in the Solr community for such a framework. See [[http://lucene-eurocon.org/slides/A-Pipeline-for-Solr_Charas-Jansson.pdf|this
presentation]] from Lucene Eurocon 2010 as well as [[http://findabilityblog.se/solr-processing-pipeline|this
blog post]] for thoughts from Find``Wise, as well as the recent solr-user thread [[http://search-lucene.com/m/pFegS7BQ7k2|Pipeline
for Solr]].
+ There have recently been interest in the Solr community for such a framework. See [[http://lucene-eurocon.org/slides/A-Pipeline-for-Solr_Charas-Jansson.pdf|this
presentation]] from Lucene Eurocon 2010 as well as [[http://findabilityblog.se/solr-processing-pipeline|this
blog post]] for thoughts from FindWise, as well as the recent solr-user thread [[http://search-lucene.com/m/pFegS7BQ7k2|Pipeline
for Solr]].
  
  = Solution proposal =
  Develop a simple, scalable, easily scriptable and configurable document processing framework
for Solr, which builds on existing best practices. The framework should be simple and lightweight
enough for use with Solr single node, and powerful enough to scale out in a separate document
processing cluster, simply by changing configuration.
@@ -78, +79 @@

  
  = Q&A =
  == Your question here ==
- 
   * Q: Is there a JIRA issue that tracks the development of this feature?
   * A: Not yet
  
@@ -86, +86 @@

   * A: SOLR-2129 is an UpdateProcessor for UIMA (see [[http://wiki.apache.org/solr/SolrUIMA|SolrUIMA]]).
Here we're talking about improving the whole UpdateProcessor framework, either by replacing
it or enhancing the existing.
  
   * Q: Will the pipelines have to be linear. For instance, could we implement a first stage
in the pipeline that would be a splitter. The splitter could, for example, break up a large
XML document into chapters, then push each chapter to the next stage where other processing
will take place. In the end, the Lucene index would have one document per chapter.
-  * A: 
+  * A:
  
-  * Q: (Your question here)
+  * Q: How will the pipelines support compound files, e.g. archives, e-mail messages with
attachments (which could be archives), etc.? This could be a problem if pipelines are linear.
-  * A: 
+  * A:
  

Mime
View raw message