lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Solr Wiki] Update of "DocumentProcessing" by JanHoydahl
Date Mon, 06 Sep 2010 22:26:39 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DocumentProcessing" page has been changed by JanHoydahl.
The comment on this change is: Starting the planning of a new pipeline architecture.


New page:
= Problem =
Solr needs a flexible document processing framework meeting the requirements of enterprise
grade content integration. Most search projects have some need for processing the incoming
content prior to indexing, for example:
 * Language identification
 * Text extraction (Tika)
 * Entity extraction and classification
 * Data normalization and cleansing
 * 3rd party systems integration (e.g. enrich document from external source)
 * etc

The built-in UpdateRequestProcessorChain is a very good starting point, as it is an integral
part of the RequestHandler architecture. However, the chain is very simple, single-threaded
and only built for local execution on the indexer. This means that any performance heavy processing
chains will slow down the whole indexer without any way to scale out processing independently.

There are many processing pipeline frameworks from which to get inspiration, such as the one
in FAST ESP, [[|OpenPipeline]], [[|OpenPipe]],
[[|Pypes]], [[|UIMA]] and others. Indeed, many
of these are already being used with Solr as a pre-processing server. 

However, the Solr community needs one single solution and more importantly a repository of
processing stages which can be shared and reused. The sharing part is crucial. If a company
develops, say a ~GeoNames stage to translate address into lat/lon, the whole community can
benefit from that by fetching the stage from the shared repository. This will not happen as
long as there is not one single preferred integration point.

There have recently been interest in the Solr community for such a framework. See [[|this
presentation]] from Lucene Eurocon 2010 for thoughts from Findwise, and now is the time to

= Solution =
Develop a simple, scalable, easily scriptable and configurable document processing framework
for Solr, which builds on existing best practices. The framework should be simple and lightweight
enough for use with Solr single node, and powerful enough to scale out in a separate document
processing cluster, simply by changing configuration.

== Key requirements ==
=== Must ===
 * Apache licensed
 * Java based
 * Lightweight
 * Support for multiple named pipelines, addressable at document ingestion
 * Must work with existing RequestHandlers (XML, CSV, DIH, Binary etc) as entry point
 * Support for metadata on document and field level (e.g. tokenized=true, language=en)
 * Allow scaling out processing to multiple dedicated servers for heavy tasks
 * Well defined API for the processing stages
 * Easy configuration of pipelines through separate XML (not in solrconfig.xml)

=== Should ===
 * Support for writing stages in JVM scripting languages such as Jython
 * Robust - if a batch fails, it should re-schedule to another processor
 * Optimize for performance through e.g. batch support
 * Support status callbacks to the client
 * SDK for stage developers - to encourage stage development
 * Separate stage repository (outside of ASF svn) to encourage sharing
 * Integration points for UIMA, [[|LingPipe]], [[|OpenNLP]]

=== Could ===
 * GUI for configuring pipelines
 * Hot pluggable pipelines
 * Function as a standalone data integration framework outside the context of Solr
 * Wrappers for custom FAST ESP stages to work with minor modification

= Proposed architecture =
Hook into the context of the existing ~UpdateRequestProcessorChain (integrate in ~ContentStreamHandlerBase)
by providing a dispatcher class, SolrPipelineDispatcher. The dispatcher would be enabled and
configured through update parameters and pipeline.mode, either from the update
request or in solrconfig.xml.

SolrPipelineDispatcher would have two modes: "local" and "distributed". In case of local mode,
the pipeline executes locally and results in the ProcessorChain being completed with RunUpdateProcessorFactory
submitting the content to local index. This would work well for single-node as well as low
load scenarios.

The "distributed" mode would enable more advanced dispatching (streaming) to a cluster of
remote worker nodes who executes the actual pipeline. This means that indexing will not (necessarily)
happen locally. Thus we introduce the possibility for a Solr node which takes on the role
of RequestHandler + Dispatcher only. 

On the remote end, there will be a Solr installation with a new PipelineRequestHandler (cmd=processPipeline)
which receives a stream of updateRequests and executes the correct pipeline. When the pipeline
has finished executing, the resulting documents enter the SolrPipelineDispatcher again and
gets dispatched to the correct shard for indexing. For this to work, the shard ID must be
configured or calculated somewhere (sounds like a good time to introduce general distributed

The shard masters which are the final targets for the pipeline will then receive the processed
documents through the PipelineRequestHandler (cmd=index) and finalize indexing.

The pipeline itself could be based on [[|Apache
Commons Pipeline]] or some code from one of the other existing pipeline projects. Benefit
with Commons Pipeline is that it is already an Apache library, built for scalability. However,
it must perhaps be adapted to suit our needs.

= Risks =
 * Automated distributed indexing is a larger problem
 * Introducing multiple worker nodes introduces sequencing issues and potential deadlocks
 * Need sophisticated dispatching and scheduling code to make a robust and fault tolerant

= Q&A =
Q: Your question here
A: Answer here

View raw message