lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "DocumentProcessing" by JanHoydahl
Date Mon, 18 Apr 2011 16:52:16 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DocumentProcessing" page has been changed by JanHoydahl.
http://wiki.apache.org/solr/DocumentProcessing?action=diff&rev1=10&rev2=11

--------------------------------------------------

   * Lightweight (not over-engineered)
   * Support for multiple named pipelines, addressable at document ingestion
   * Support for metadata on document and field level (e.g. tokenized=true, language=en)
-  * Allow scaling out processing to multiple dedicated servers for heavy tasks
+  * Allow scaling out processing to multiple dedicated servers for heavy tasks. Cloud-friendly
-  * Well defined API for the processing stages
+  * Well defined API and SDK for the processing stages
   * Easy configuration of pipelines through separate config and GUI
  
  === Should ===
   * Function as a standalone data integration framework outside the context of Solr
-  * Allow drop-in integration with existing Solr installs, i.e. called from UpdateProcessor
-  * Accept documents from any Solr client including [[http://incubator.apache.org/connectors/|ManifoldCF]]
+  * Allow drop-in integration with existing Solr installs, i.e. accept documents from any
Solr client including [[http://incubator.apache.org/connectors/|ManifoldCF]]
   * Support for writing stages in JVM scripting languages such as Jython
   * Robust - if a batch fails, it should re-schedule to another processor
   * Optimize for performance through e.g. batch support
@@ -57, +56 @@

   * Wrappers for custom UpdateProcessor stages to work with minor modification
  
  = Anti-patterns =
-  * Do not require new APIs, but allow integration inside Solr and feeding through existing
Update``Request``Handlers
-  * Do not over-architecture like Eclipse SMILA and others have done with ESB etc
+  * Do not over-architecture like Eclipse SMILA and others have done going crazy with ESB
etc
  
  = Proposed architecture =
  A good starting point for the core (standalone) pipeline could be the Apache-licensed [[http://openpipe.berlios.de/|OpenPipe]],
which already works stand-alone. Add some config API and GUI.
  
  Glue code to hook the pipeline into Solr could be an UpdateRequestProcessor which can either
work in "local" mode, executing the pipeline locally in-thread, or in "distributed" mode which
would dispatch the batch to an available node in a document processing cluster.
  
- I envision that the whole pipeline could (in addition to running standalone) be wrapped
in a Solr RequestHandler i.e. a Document-processing-only node would be an instance of Solr
with a BinaryDocumentRequestHandler, without a local index. When processing is finished, the
documents are routed to the final destination for indexing  (perhpas using [[https://issues.apache.org/jira/browse/SOLR-2358|SOLR-2358]]).
+ I envision that the whole pipeline could (in addition to running standalone) be wrapped
in a Solr RequestHandler i.e. a Document-processing-only node would be an instance of Solr
with a new BinaryDocumentRequestHandler, without a local index. When processing is finished,
the documents are routed to the final destination for indexing (perhpas using [[https://issues.apache.org/jira/browse/SOLR-2358|SOLR-2358]]).
  
  = Risks =
   * Automated distributed indexing [[https://issues.apache.org/jira/browse/SOLR-2358|SOLR-2358]]
needs to work with this

Mime
View raw message