lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "DocumentProcessing" by JanHoydahl
Date Thu, 20 Oct 2011 12:15:43 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DocumentProcessing" page has been changed by JanHoydahl:
http://wiki.apache.org/solr/DocumentProcessing?action=diff&rev1=18&rev2=19

Comment:
Added Behemot

  
  The built-in UpdateRequestProcessorChain is capable of doing simple simple processing jobs,
but it is only built for local execution on the indexer node in the same thread. This means
that any performance heavy processing chains will slow down the indexers without any way to
scale out processing independently. We have seen FAST systems with far more servers doing
document processing than indexing.
  
+ 
+ ---- /!\ '''Edit conflict - other version:''' ----
  There are many processing pipeline frameworks from which to get inspiration, such as the
one in FAST ESP, [[http://www.openpipeline.org/|OpenPipeline]], [[http://openpipe.berlios.de/|OpenPipe]]
(now on [[https://github.com/kolstae/openpipe|GitHub]]), [[http://www.pypes.org/|Pypes]],
[[http://uima.apache.org/|UIMA]], [[http://www.eclipse.org/smila/|Eclipse SMILA]], [[http://commons.apache.org/sandbox/pipeline/|Apache
commons pipeline]], [[http://www.piped.io/|Piped]] and others. Indeed, some of these are already
being used with Solr as a pre-processing server. This means weak coupling but also weak re-use
of code. Each new project will have to choose which of the pipelines to invest in.
+ 
+ ---- /!\ '''Edit conflict - your version:''' ----
+ There are many processing pipeline frameworks from which to get inspiration, such as the
one in FAST ESP, [[http://www.openpipeline.org/|OpenPipeline]], [[http://openpipe.berlios.de/|OpenPipe]]
(now on [[https://github.com/kolstae/openpipe|GitHub]]), [[http://www.pypes.org/|Pypes]],
[[http://uima.apache.org/|UIMA]], [[http://www.eclipse.org/smila/|Eclipse SMILA]], [[http://commons.apache.org/sandbox/pipeline/|Apache
commons pipeline]], [[http://found.no/products/piped/|Piped]], [[https://github.com/jnioche/behemoth|Behemot]]
and others. Indeed, some of these are already being used with Solr as a pre-processing server.
This means weak coupling but also weak re-use of code. Each new project will have to choose
which of the pipelines to invest in.
+ 
+ ---- /!\ '''End of edit conflict''' ----
  
  The community would benefit from an official processing framework -- and more importantly
an official repository of processing stages which are shared and reused. The sharing part
is crucial. If a company develops, say a GeoNames stage to translate address into lat/lon,
the whole community can benefit from that by fetching the stage from the shared repository.
This will not happen as long as there is not one single preferred integration point.
  
- There have recently been interest in the Solr community for such a framework. See [[http://lucene-eurocon.org/slides/A-Pipeline-for-Solr_Charas-Jansson.pdf|this
presentation]] from Lucene Eurocon 2010 as well as [[http://findabilityblog.se/solr-processing-pipeline|this
blog post]] for thoughts from FindWise, as well as the recent solr-user thread [[http://search-lucene.com/m/pFegS7BQ7k2|Pipeline
for Solr]].
+ There have recently been interest in the Solr community for such a framework. See [[http://lucene-eurocon.org/slides/A-Pipeline-for-Solr_Charas-Jansson.pdf|this
presentation]] from Lucene Eurocon 2010 as well as [[http://findabilityblog.se/solr-processing-pipeline|this
blog post]] for thoughts from FindWise, as well as the recent solr-user thread [[http://search-lucene.com/m/pFegS7BQ7k2|Pipeline
for Solr]] and Cominvent's talk at Lucene Eurocon 2011 [[http://www.slideshare.net/janhoy/improving-the-solr-update-chain|Improving
Solr's Update Chain]].
  
  = Solution proposal =
  Develop a simple, scalable, easily scriptable and configurable document processing framework
for Solr, which builds on existing best practices. The framework should be simple and lightweight
enough for use with Solr single node, and powerful enough to scale out in a separate document
processing cluster, simply by changing configuration.

Mime
View raw message