lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "DocumentProcessing" by JanHoydahl
Date Mon, 18 Apr 2011 16:48:55 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DocumentProcessing" page has been changed by JanHoydahl.
The comment on this change is: Simplified.
http://wiki.apache.org/solr/DocumentProcessing?action=diff&rev1=9&rev2=10

--------------------------------------------------

   * Java based
   * Lightweight (not over-engineered)
   * Support for multiple named pipelines, addressable at document ingestion
-  * Easy drop-in integration with existing Solr installs, i.e. called from UpdateProcessor
   * Support for metadata on document and field level (e.g. tokenized=true, language=en)
   * Allow scaling out processing to multiple dedicated servers for heavy tasks
   * Well defined API for the processing stages
@@ -39, +38 @@

  
  === Should ===
   * Function as a standalone data integration framework outside the context of Solr
+  * Allow drop-in integration with existing Solr installs, i.e. called from UpdateProcessor
+  * Accept documents from any Solr client including [[http://incubator.apache.org/connectors/|ManifoldCF]]
   * Support for writing stages in JVM scripting languages such as Jython
   * Robust - if a batch fails, it should re-schedule to another processor
   * Optimize for performance through e.g. batch support
@@ -60, +61 @@

   * Do not over-architecture like Eclipse SMILA and others have done with ESB etc
  
  = Proposed architecture =
- The core pipeline and Processor SDK self-contained and not depend on Solr APIs. A good starting
point for the core pipeline could be the Apache-licensed [[http://openpipe.berlios.de/|OpenPipe]],
which already works stand-alone. We could add GUI config and scalability to this code base.
+ A good starting point for the core (standalone) pipeline could be the Apache-licensed [[http://openpipe.berlios.de/|OpenPipe]],
which already works stand-alone. Add some config API and GUI.
  
- Glue code to hook the pipeline into Solr could be an UpdateRequestProcessor, e.g. Pipeline``Dispatcher``Processor
(or deeper through Content``Stream``Handler``Base?). The dispatcher would be enabled and configured
through update parameters, e.g. pipeline.name and pipeline.mode, either from the update request
or in solrconfig.xml.
+ Glue code to hook the pipeline into Solr could be an UpdateRequestProcessor which can either
work in "local" mode, executing the pipeline locally in-thread, or in "distributed" mode which
would dispatch the batch to an available node in a document processing cluster.
  
+ I envision that the whole pipeline could (in addition to running standalone) be wrapped
in a Solr RequestHandler i.e. a Document-processing-only node would be an instance of Solr
with a BinaryDocumentRequestHandler, without a local index. When processing is finished, the
documents are routed to the final destination for indexing  (perhpas using [[https://issues.apache.org/jira/browse/SOLR-2358|SOLR-2358]]).
- Pipeline``Dispatcher``Processor would have two possible modes: "local" and "distributed".
In case of local mode, the pipeline executes locally in-thread and results in the ProcessorChain
being completed with RunUpdateProcessorFactory submitting the content to local index. This
would work well for single-node as well as low load scenarios. Local mode is easiest to implement
and could be phase one.
- 
- == Distributed mode ==
- The "distributed" mode would enable more advanced dispatching (streaming) to a cluster of
remote worker nodes which execute the actual pipeline. This means that indexing will not happen
locally. Thus a Solr node can take the role as RequestHandler + Pipeline``Dispatcher only,
or as a Document Processor only. The dispatcher streams output to a Request``Handler on the
processing node. When the pipeline has finished executing, the resulting documents enter the
Pipeline``Dispatcher again and get routed to the correct shard for indexing (also see [[https://issues.apache.org/jira/browse/SOLR-2358|SOLR-2358]]).
As we can tell, there are some major devlopment effort to support distributed pipelines!
  
  = Risks =
   * Automated distributed indexing [[https://issues.apache.org/jira/browse/SOLR-2358|SOLR-2358]]
needs to work with this

Mime
View raw message