Mailing-List: contact solr-commits-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-dev@lucene.apache.org
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
From: Apache Wiki <wikidiffs@apache.org>
To: Apache Wiki <wikidiffs@apache.org>
Date: Tue, 21 Sep 2010 01:18:31 -0000
Message-ID: <20100921011831.64298.44266@eosnew.apache.org>
Subject: 
 =?utf-8?q?=5BSolr_Wiki=5D_Update_of_=22DocumentProcessing=22_by_JanHoydah?=
 =?utf-8?q?l?=

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for chan=
ge notification.

The "DocumentProcessing" page has been changed by JanHoydahl.
The comment on this change is: Cleaned up some paragraphs, added link to SM=
ILA.
http://wiki.apache.org/solr/DocumentProcessing?action=3Ddiff&rev1=3D4&rev2=
=3D5

--------------------------------------------------

  (This page is a child of the TaskList page)
  =

  =3D Problem =3D
- Solr needs a flexible document processing framework meeting the requireme=
nts of enterprise grade content integration. Most search projects have some=
 need for processing the incoming content prior to indexing, for example:
+ Solr would benefit from a flexible document processing framework meeting =
the requirements of enterprise grade content integration. Most search proje=
cts have some need for processing the incoming content prior to indexing, f=
or example:
   * Language identification
   * Text extraction (Tika)
-  * Entity extraction and classification
+  * Entity extraction and classification (e.g. UIMA)
   * Data normalization and cleansing
   * 3rd party systems integration (e.g. enrich document from external sour=
ce)
   * etc
  =

- The built-in UpdateRequestProcessorChain is a very good starting point, a=
s it is an integral part of the RequestHandler architecture. However, the c=
hain is very simple, single-threaded and only built for local execution on =
the indexer. This means that any performance heavy processing chains will s=
low down the whole indexer without any way to scale out processing independ=
ently.
+ The built-in UpdateRequestProcessorChain is a very good starting point. H=
owever, the chain is very simple, single-threaded and only built for local =
execution on the indexer node. This means that any performance heavy proces=
sing chains will slow down the indexers without any way to scale out proces=
sing independently. We have seen FAST systems with far more servers doing d=
ocument processing than indexing.
  =

- There are many processing pipeline frameworks from which to get inspirati=
on, such as the one in FAST ESP, [[http://www.openpipeline.org/|OpenPipelin=
e]], [[http://openpipe.berlios.de/|OpenPipe]], [[http://www.pypes.org/|Pype=
s]], [[http://uima.apache.org/|UIMA]] and others. Indeed, many of these are=
 already being used with Solr as a pre-processing server. =

+ There are many processing pipeline frameworks from which to get inspirati=
on, such as the one in FAST ESP, [[http://www.openpipeline.org/|OpenPipelin=
e]], [[http://openpipe.berlios.de/|OpenPipe]], [[http://www.pypes.org/|Pype=
s]], [[http://uima.apache.org/|UIMA]], [[http://www.eclipse.org/smila/|Ecli=
pse SMILA]] and others. Indeed, some of these are already being used with S=
olr as a pre-processing server. This means weak coupling but also weak re-u=
se of code. Each new project will have to choose which of the pipelines to =
invest in.
  =

- However, the Solr community needs one single solution and more importantl=
y a repository of processing stages which can be shared and reused. The sha=
ring part is crucial. If a company develops, say a Geo``Names stage to tran=
slate address into lat/lon, the whole community can benefit from that by fe=
tching the stage from the shared repository. This will not happen as long a=
s there is not one single preferred integration point.
+ The community would benefit from an official processing framework and mor=
e importantly an official repository of processing stages which are shared =
and reused. The sharing part is crucial. If a company develops, say a Geo``=
Names stage to translate address into lat/lon, the whole community can bene=
fit from that by fetching the stage from the shared repository. This will n=
ot happen as long as there is not one single preferred integration point.
  =

- There have recently been interest in the Solr community for such a framew=
ork. See [[http://lucene-eurocon.org/slides/A-Pipeline-for-Solr_Charas-Jans=
son.pdf|this presentation]] from Lucene Eurocon 2010 for thoughts from Find=
``Wise.
+ There have recently been interest in the Solr community for such a framew=
ork. See [[http://lucene-eurocon.org/slides/A-Pipeline-for-Solr_Charas-Jans=
son.pdf|this presentation]] from Lucene Eurocon 2010 as well as [[http://fi=
ndabilityblog.se/solr-processing-pipeline|this blog post]] for thoughts fro=
m Find``Wise.
  =

  =3D Solution =3D
  Develop a simple, scalable, easily scriptable and configurable document p=
rocessing framework for Solr, which builds on existing best practices. The =
framework should be simple and lightweight enough for use with Solr single =
node, and powerful enough to scale out in a separate document processing cl=
uster, simply by changing configuration.
@@ -26, +26 @@

  =3D=3D=3D Must =3D=3D=3D
   * Apache licensed
   * Java based
-  * Lightweight
+  * Lightweight (not over-engineered)
   * Support for multiple named pipelines, addressable at document ingestion
   * Must work with existing Request``Handlers (XML, CSV, DIH, Binary etc) =
as entry point
   * Allow as drop-in feature to existing installs (after upgrading to need=
ed Solr version)
@@ -53, +53 @@

   * Wrappers for custom FAST ESP stages to work with minor modification
  =

  =3D Anti-patterns =3D
-  * Do not require all new APIs
+  * Do not require new APIs, but allow feeding through existing Update``Re=
quest``Handlers
  =

  =3D Proposed architecture =3D
  Hook into the context of the existing UpdateRequestProcessorChain (integr=
ate in Content``Stream``Handler``Base) by providing a dispatcher class, Sol=
r``Pipeline``Dispatcher. The dispatcher would be enabled and configured thr=
ough update parameters pipeline.name and pipeline.mode, either from the upd=
ate request or in solrconfig.xml.
  =

- Solr``Pipeline``Dispatcher would have two modes: "local" and "distributed=
". In case of local mode, the pipeline executes locally and results in the =
ProcessorChain being completed with RunUpdateProcessorFactory submitting th=
e content to local index. This would work well for single-node as well as l=
ow load scenarios.
+ Solr``Pipeline``Dispatcher would have two possible modes: "local" and "di=
stributed". In case of local mode, the pipeline executes locally and result=
s in the ProcessorChain being completed with RunUpdateProcessorFactory subm=
itting the content to local index. This would work well for single-node as =
well as low load scenarios. Local mode is easiest to implement and could be=
 phase one.
  =

- The "distributed" mode would enable more advanced dispatching (streaming)=
 to a cluster of remote worker nodes who executes the actual pipeline. This=
 means that indexing will not (necessarily) happen locally. Thus we introdu=
ce the possibility for a Solr node which takes on the role of RequestHandle=
r + Dispatcher only. =

+ We need a robust architecture for configuring and executing pipelines; pr=
eferably multi threaded. We could start from scratch or base it on another =
mature framework such as [[http://commons.apache.org/sandbox/pipeline/|Apac=
he Commons Pipeline]], Open``Pipe or some other project with a compatible l=
icense who are willing to donate to ASF. Apache Commons Pipeline is not dir=
ectly what we need, it has a funny, somewhat rigid, stage architecture with=
 each stage having its own queue and thread(s) instead of running a whole p=
ipeline in the same thread.
  =

+ =3D=3D Distributed mode =3D=3D
+ The "distributed" mode would enable more advanced dispatching (streaming)=
 to a cluster of remote worker nodes which execute the actual pipeline. Thi=
s means that indexing will not happen locally. Thus a Solr node can take th=
e role as RequestHandler + Pipeline``Dispatcher only, or as a Document Proc=
essor only. The dispatcher streams output to a Request``Handler on the proc=
essing node. When the pipeline has finished executing, the resulting docume=
nts enter the Solr``Pipeline``Dispatcher again and get routed to the correc=
t shard for indexing. As we can tell, there are some major devlopment effor=
t to support distributed pipelines!
- On the remote end, there will be a Solr installation with a new Pipeline`=
`Request``Handler (cmd=3DprocessPipeline) which receives a stream of update=
Requests and executes the correct pipeline. When the pipeline has finished =
executing, the resulting documents enter the Solr``Pipeline``Dispatcher aga=
in and gets dispatched to the correct shard for indexing. For this to work,=
 the shard ID must be configured or calculated somewhere (sounds like a goo=
d time to introduce general distributed indexing!).
- =

- The shard masters which are the final targets for the pipeline will then =
receive the processed documents through the Pipeline``Request``Handler (cmd=
=3Dindex) and finalize indexing.
- =

- The pipeline itself could be based on [[http://commons.apache.org/sandbox=
/pipeline/|Apache Commons Pipeline]] or some code from one of the other exi=
sting pipeline projects. Benefit with Commons Pipeline is that it is alread=
y an Apache library, built for scalability. However, it must perhaps be ada=
pted to suit our needs.
  =

  =3D Risks =3D
-  * Automated distributed indexing is a larger problem
+  * Automated distributed indexing is a larger problem. Split the camel!
   * Introducing multiple worker nodes introduces sequencing issues and pot=
ential deadlocks
   * Need sophisticated dispatching and scheduling code to make a robust an=
d fault tolerant model
  =

  =3D Q&A =3D
- Q: Your question here
+ =3D=3D Your question here =3D=3D
- A: Answer here
+ Answer here
 =20