lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Trivial Update of "DocumentProcessing" by JanHoydahl
Date Mon, 06 Sep 2010 22:37:57 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DocumentProcessing" page has been changed by JanHoydahl.
The comment on this change is: Removing unwanted wikilinks.
http://wiki.apache.org/solr/DocumentProcessing?action=diff&rev1=1&rev2=2

--------------------------------------------------

  
  There are many processing pipeline frameworks from which to get inspiration, such as the
one in FAST ESP, [[http://www.openpipeline.org/|OpenPipeline]], [[http://openpipe.berlios.de/|OpenPipe]],
[[http://www.pypes.org/|Pypes]], [[http://uima.apache.org/|UIMA]] and others. Indeed, many
of these are already being used with Solr as a pre-processing server. 
  
- However, the Solr community needs one single solution and more importantly a repository
of processing stages which can be shared and reused. The sharing part is crucial. If a company
develops, say a ~GeoNames stage to translate address into lat/lon, the whole community can
benefit from that by fetching the stage from the shared repository. This will not happen as
long as there is not one single preferred integration point.
+ However, the Solr community needs one single solution and more importantly a repository
of processing stages which can be shared and reused. The sharing part is crucial. If a company
develops, say a Geo``Names stage to translate address into lat/lon, the whole community can
benefit from that by fetching the stage from the shared repository. This will not happen as
long as there is not one single preferred integration point.
  
  There have recently been interest in the Solr community for such a framework. See [[http://lucene-eurocon.org/slides/A-Pipeline-for-Solr_Charas-Jansson.pdf|this
presentation]] from Lucene Eurocon 2010 for thoughts from Findwise, and now is the time to
move.
  
@@ -26, +26 @@

   * Java based
   * Lightweight
   * Support for multiple named pipelines, addressable at document ingestion
-  * Must work with existing RequestHandlers (XML, CSV, DIH, Binary etc) as entry point
+  * Must work with existing Request``Handlers (XML, CSV, DIH, Binary etc) as entry point
   * Support for metadata on document and field level (e.g. tokenized=true, language=en)
   * Allow scaling out processing to multiple dedicated servers for heavy tasks
   * Well defined API for the processing stages
@@ -48, +48 @@

   * Wrappers for custom FAST ESP stages to work with minor modification
  
  = Proposed architecture =
- Hook into the context of the existing ~UpdateRequestProcessorChain (integrate in ~ContentStreamHandlerBase)
by providing a dispatcher class, SolrPipelineDispatcher. The dispatcher would be enabled and
configured through update parameters pipeline.name and pipeline.mode, either from the update
request or in solrconfig.xml.
+ Hook into the context of the existing UpdateRequestProcessorChain (integrate in Content``Stream``Handler``Base)
by providing a dispatcher class, Solr``Pipeline``Dispatcher. The dispatcher would be enabled
and configured through update parameters pipeline.name and pipeline.mode, either from the
update request or in solrconfig.xml.
  
- SolrPipelineDispatcher would have two modes: "local" and "distributed". In case of local
mode, the pipeline executes locally and results in the ProcessorChain being completed with
RunUpdateProcessorFactory submitting the content to local index. This would work well for
single-node as well as low load scenarios.
+ Solr``Pipeline``Dispatcher would have two modes: "local" and "distributed". In case of local
mode, the pipeline executes locally and results in the ProcessorChain being completed with
RunUpdateProcessorFactory submitting the content to local index. This would work well for
single-node as well as low load scenarios.
  
  The "distributed" mode would enable more advanced dispatching (streaming) to a cluster of
remote worker nodes who executes the actual pipeline. This means that indexing will not (necessarily)
happen locally. Thus we introduce the possibility for a Solr node which takes on the role
of RequestHandler + Dispatcher only. 
  
- On the remote end, there will be a Solr installation with a new PipelineRequestHandler (cmd=processPipeline)
which receives a stream of updateRequests and executes the correct pipeline. When the pipeline
has finished executing, the resulting documents enter the SolrPipelineDispatcher again and
gets dispatched to the correct shard for indexing. For this to work, the shard ID must be
configured or calculated somewhere (sounds like a good time to introduce general distributed
indexing!).
+ On the remote end, there will be a Solr installation with a new Pipeline``Request``Handler
(cmd=processPipeline) which receives a stream of updateRequests and executes the correct pipeline.
When the pipeline has finished executing, the resulting documents enter the Solr``Pipeline``Dispatcher
again and gets dispatched to the correct shard for indexing. For this to work, the shard ID
must be configured or calculated somewhere (sounds like a good time to introduce general distributed
indexing!).
  
- The shard masters which are the final targets for the pipeline will then receive the processed
documents through the PipelineRequestHandler (cmd=index) and finalize indexing.
+ The shard masters which are the final targets for the pipeline will then receive the processed
documents through the Pipeline``Request``Handler (cmd=index) and finalize indexing.
  
  The pipeline itself could be based on [[http://commons.apache.org/sandbox/pipeline/|Apache
Commons Pipeline]] or some code from one of the other existing pipeline projects. Benefit
with Commons Pipeline is that it is already an Apache library, built for scalability. However,
it must perhaps be adapted to suit our needs.
  

Mime
View raw message