lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Solr Wiki] Update of "DocumentProcessing" by JanHoydahl
Date Mon, 18 Apr 2011 18:43:41 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DocumentProcessing" page has been changed by JanHoydahl.
The comment on this change is: Clarification.


  = Anti-patterns =
   * Do not over-architecture like Eclipse SMILA and others have done going crazy with ESB
+  * Do not try to be a connector framework as well. Let ManifoldCF do that job. Focuson on
the pipeline!
+  * Do not keep the source private (although Apache licensed) as DieselPoint did with OpenPipeline
- create a community!
  = Proposed architecture =
@@ -66, +68 @@

  Glue code to hook the pipeline into Solr could be an UpdateRequestProcessor which can either
work in "local" mode, executing the pipeline locally in-thread, or in "distributed" mode which
would dispatch the batch to an available node in a document processing cluster.
  I envision that the whole pipeline could (in addition to running standalone) be wrapped
in a Solr RequestHandler i.e. a Document-processing-only node would be an instance of Solr
with a new BinaryDocumentRequestHandler, without a local index. When processing is finished,
the documents are routed to the final destination for indexing (perhpas using [[|SOLR-2358]]).
+ The architecture diagram above shows the local and the fully distributed cases. Another
option would be to round-robin feeding to the set of pipeline nodes directly (not needing
a BinaryDocumentRequestHandler), and letting them do the distributed indexing as the last
  = Risks =
   * Automated distributed indexing [[|SOLR-2358]]
needs to work with this

View raw message