lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "DocumentProcessing" by JanHoydahl
Date Tue, 21 Sep 2010 01:18:31 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DocumentProcessing" page has been changed by JanHoydahl.
The comment on this change is: Cleaned up some paragraphs, added link to SMILA.
http://wiki.apache.org/solr/DocumentProcessing?action=diff&rev1=4&rev2=5

--------------------------------------------------

  (This page is a child of the TaskList page)
  
  = Problem =
- Solr needs a flexible document processing framework meeting the requirements of enterprise
grade content integration. Most search projects have some need for processing the incoming
content prior to indexing, for example:
+ Solr would benefit from a flexible document processing framework meeting the requirements
of enterprise grade content integration. Most search projects have some need for processing
the incoming content prior to indexing, for example:
   * Language identification
   * Text extraction (Tika)
-  * Entity extraction and classification
+  * Entity extraction and classification (e.g. UIMA)
   * Data normalization and cleansing
   * 3rd party systems integration (e.g. enrich document from external source)
   * etc
  
- The built-in UpdateRequestProcessorChain is a very good starting point, as it is an integral
part of the RequestHandler architecture. However, the chain is very simple, single-threaded
and only built for local execution on the indexer. This means that any performance heavy processing
chains will slow down the whole indexer without any way to scale out processing independently.
+ The built-in UpdateRequestProcessorChain is a very good starting point. However, the chain
is very simple, single-threaded and only built for local execution on the indexer node. This
means that any performance heavy processing chains will slow down the indexers without any
way to scale out processing independently. We have seen FAST systems with far more servers
doing document processing than indexing.
  
- There are many processing pipeline frameworks from which to get inspiration, such as the
one in FAST ESP, [[http://www.openpipeline.org/|OpenPipeline]], [[http://openpipe.berlios.de/|OpenPipe]],
[[http://www.pypes.org/|Pypes]], [[http://uima.apache.org/|UIMA]] and others. Indeed, many
of these are already being used with Solr as a pre-processing server. 
+ There are many processing pipeline frameworks from which to get inspiration, such as the
one in FAST ESP, [[http://www.openpipeline.org/|OpenPipeline]], [[http://openpipe.berlios.de/|OpenPipe]],
[[http://www.pypes.org/|Pypes]], [[http://uima.apache.org/|UIMA]], [[http://www.eclipse.org/smila/|Eclipse
SMILA]] and others. Indeed, some of these are already being used with Solr as a pre-processing
server. This means weak coupling but also weak re-use of code. Each new project will have
to choose which of the pipelines to invest in.
  
- However, the Solr community needs one single solution and more importantly a repository
of processing stages which can be shared and reused. The sharing part is crucial. If a company
develops, say a Geo``Names stage to translate address into lat/lon, the whole community can
benefit from that by fetching the stage from the shared repository. This will not happen as
long as there is not one single preferred integration point.
+ The community would benefit from an official processing framework and more importantly an
official repository of processing stages which are shared and reused. The sharing part is
crucial. If a company develops, say a Geo``Names stage to translate address into lat/lon,
the whole community can benefit from that by fetching the stage from the shared repository.
This will not happen as long as there is not one single preferred integration point.
  
- There have recently been interest in the Solr community for such a framework. See [[http://lucene-eurocon.org/slides/A-Pipeline-for-Solr_Charas-Jansson.pdf|this
presentation]] from Lucene Eurocon 2010 for thoughts from Find``Wise.
+ There have recently been interest in the Solr community for such a framework. See [[http://lucene-eurocon.org/slides/A-Pipeline-for-Solr_Charas-Jansson.pdf|this
presentation]] from Lucene Eurocon 2010 as well as [[http://findabilityblog.se/solr-processing-pipeline|this
blog post]] for thoughts from Find``Wise.
  
  = Solution =
  Develop a simple, scalable, easily scriptable and configurable document processing framework
for Solr, which builds on existing best practices. The framework should be simple and lightweight
enough for use with Solr single node, and powerful enough to scale out in a separate document
processing cluster, simply by changing configuration.
@@ -26, +26 @@

  === Must ===
   * Apache licensed
   * Java based
-  * Lightweight
+  * Lightweight (not over-engineered)
   * Support for multiple named pipelines, addressable at document ingestion
   * Must work with existing Request``Handlers (XML, CSV, DIH, Binary etc) as entry point
   * Allow as drop-in feature to existing installs (after upgrading to needed Solr version)
@@ -53, +53 @@

   * Wrappers for custom FAST ESP stages to work with minor modification
  
  = Anti-patterns =
-  * Do not require all new APIs
+  * Do not require new APIs, but allow feeding through existing Update``Request``Handlers
  
  = Proposed architecture =
  Hook into the context of the existing UpdateRequestProcessorChain (integrate in Content``Stream``Handler``Base)
by providing a dispatcher class, Solr``Pipeline``Dispatcher. The dispatcher would be enabled
and configured through update parameters pipeline.name and pipeline.mode, either from the
update request or in solrconfig.xml.
  
- Solr``Pipeline``Dispatcher would have two modes: "local" and "distributed". In case of local
mode, the pipeline executes locally and results in the ProcessorChain being completed with
RunUpdateProcessorFactory submitting the content to local index. This would work well for
single-node as well as low load scenarios.
+ Solr``Pipeline``Dispatcher would have two possible modes: "local" and "distributed". In
case of local mode, the pipeline executes locally and results in the ProcessorChain being
completed with RunUpdateProcessorFactory submitting the content to local index. This would
work well for single-node as well as low load scenarios. Local mode is easiest to implement
and could be phase one.
  
- The "distributed" mode would enable more advanced dispatching (streaming) to a cluster of
remote worker nodes who executes the actual pipeline. This means that indexing will not (necessarily)
happen locally. Thus we introduce the possibility for a Solr node which takes on the role
of RequestHandler + Dispatcher only. 
+ We need a robust architecture for configuring and executing pipelines; preferably multi
threaded. We could start from scratch or base it on another mature framework such as [[http://commons.apache.org/sandbox/pipeline/|Apache
Commons Pipeline]], Open``Pipe or some other project with a compatible license who are willing
to donate to ASF. Apache Commons Pipeline is not directly what we need, it has a funny, somewhat
rigid, stage architecture with each stage having its own queue and thread(s) instead of running
a whole pipeline in the same thread.
  
+ == Distributed mode ==
+ The "distributed" mode would enable more advanced dispatching (streaming) to a cluster of
remote worker nodes which execute the actual pipeline. This means that indexing will not happen
locally. Thus a Solr node can take the role as RequestHandler + Pipeline``Dispatcher only,
or as a Document Processor only. The dispatcher streams output to a Request``Handler on the
processing node. When the pipeline has finished executing, the resulting documents enter the
Solr``Pipeline``Dispatcher again and get routed to the correct shard for indexing. As we can
tell, there are some major devlopment effort to support distributed pipelines!
- On the remote end, there will be a Solr installation with a new Pipeline``Request``Handler
(cmd=processPipeline) which receives a stream of updateRequests and executes the correct pipeline.
When the pipeline has finished executing, the resulting documents enter the Solr``Pipeline``Dispatcher
again and gets dispatched to the correct shard for indexing. For this to work, the shard ID
must be configured or calculated somewhere (sounds like a good time to introduce general distributed
indexing!).
- 
- The shard masters which are the final targets for the pipeline will then receive the processed
documents through the Pipeline``Request``Handler (cmd=index) and finalize indexing.
- 
- The pipeline itself could be based on [[http://commons.apache.org/sandbox/pipeline/|Apache
Commons Pipeline]] or some code from one of the other existing pipeline projects. Benefit
with Commons Pipeline is that it is already an Apache library, built for scalability. However,
it must perhaps be adapted to suit our needs.
  
  = Risks =
-  * Automated distributed indexing is a larger problem
+  * Automated distributed indexing is a larger problem. Split the camel!
   * Introducing multiple worker nodes introduces sequencing issues and potential deadlocks
   * Need sophisticated dispatching and scheduling code to make a robust and fault tolerant
model
  
  = Q&A =
- Q: Your question here
+ == Your question here ==
- A: Answer here
+ Answer here
  

Mime
View raw message