Return-Path: Delivered-To: apmail-lucene-solr-commits-archive@minotaur.apache.org Received: (qmail 91246 invoked from network); 21 Sep 2010 01:19:02 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 21 Sep 2010 01:19:02 -0000 Received: (qmail 14366 invoked by uid 500); 21 Sep 2010 01:19:02 -0000 Delivered-To: apmail-lucene-solr-commits-archive@lucene.apache.org Received: (qmail 14254 invoked by uid 500); 21 Sep 2010 01:19:01 -0000 Mailing-List: contact solr-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-dev@lucene.apache.org Delivered-To: mailing list solr-commits@lucene.apache.org Received: (qmail 14247 invoked by uid 99); 21 Sep 2010 01:19:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Sep 2010 01:19:01 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.131] (HELO eos.apache.org) (140.211.11.131) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Sep 2010 01:19:00 +0000 Received: from eosnew.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 022A68C6 for ; Tue, 21 Sep 2010 01:18:32 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Tue, 21 Sep 2010 01:18:31 -0000 Message-ID: <20100921011831.64298.44266@eosnew.apache.org> Subject: =?utf-8?q?=5BSolr_Wiki=5D_Update_of_=22DocumentProcessing=22_by_JanHoydah?= =?utf-8?q?l?= Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for chan= ge notification. The "DocumentProcessing" page has been changed by JanHoydahl. The comment on this change is: Cleaned up some paragraphs, added link to SM= ILA. http://wiki.apache.org/solr/DocumentProcessing?action=3Ddiff&rev1=3D4&rev2= =3D5 -------------------------------------------------- (This page is a child of the TaskList page) = =3D Problem =3D - Solr needs a flexible document processing framework meeting the requireme= nts of enterprise grade content integration. Most search projects have some= need for processing the incoming content prior to indexing, for example: + Solr would benefit from a flexible document processing framework meeting = the requirements of enterprise grade content integration. Most search proje= cts have some need for processing the incoming content prior to indexing, f= or example: * Language identification * Text extraction (Tika) - * Entity extraction and classification + * Entity extraction and classification (e.g. UIMA) * Data normalization and cleansing * 3rd party systems integration (e.g. enrich document from external sour= ce) * etc = - The built-in UpdateRequestProcessorChain is a very good starting point, a= s it is an integral part of the RequestHandler architecture. However, the c= hain is very simple, single-threaded and only built for local execution on = the indexer. This means that any performance heavy processing chains will s= low down the whole indexer without any way to scale out processing independ= ently. + The built-in UpdateRequestProcessorChain is a very good starting point. H= owever, the chain is very simple, single-threaded and only built for local = execution on the indexer node. This means that any performance heavy proces= sing chains will slow down the indexers without any way to scale out proces= sing independently. We have seen FAST systems with far more servers doing d= ocument processing than indexing. = - There are many processing pipeline frameworks from which to get inspirati= on, such as the one in FAST ESP, [[http://www.openpipeline.org/|OpenPipelin= e]], [[http://openpipe.berlios.de/|OpenPipe]], [[http://www.pypes.org/|Pype= s]], [[http://uima.apache.org/|UIMA]] and others. Indeed, many of these are= already being used with Solr as a pre-processing server. = + There are many processing pipeline frameworks from which to get inspirati= on, such as the one in FAST ESP, [[http://www.openpipeline.org/|OpenPipelin= e]], [[http://openpipe.berlios.de/|OpenPipe]], [[http://www.pypes.org/|Pype= s]], [[http://uima.apache.org/|UIMA]], [[http://www.eclipse.org/smila/|Ecli= pse SMILA]] and others. Indeed, some of these are already being used with S= olr as a pre-processing server. This means weak coupling but also weak re-u= se of code. Each new project will have to choose which of the pipelines to = invest in. = - However, the Solr community needs one single solution and more importantl= y a repository of processing stages which can be shared and reused. The sha= ring part is crucial. If a company develops, say a Geo``Names stage to tran= slate address into lat/lon, the whole community can benefit from that by fe= tching the stage from the shared repository. This will not happen as long a= s there is not one single preferred integration point. + The community would benefit from an official processing framework and mor= e importantly an official repository of processing stages which are shared = and reused. The sharing part is crucial. If a company develops, say a Geo``= Names stage to translate address into lat/lon, the whole community can bene= fit from that by fetching the stage from the shared repository. This will n= ot happen as long as there is not one single preferred integration point. = - There have recently been interest in the Solr community for such a framew= ork. See [[http://lucene-eurocon.org/slides/A-Pipeline-for-Solr_Charas-Jans= son.pdf|this presentation]] from Lucene Eurocon 2010 for thoughts from Find= ``Wise. + There have recently been interest in the Solr community for such a framew= ork. See [[http://lucene-eurocon.org/slides/A-Pipeline-for-Solr_Charas-Jans= son.pdf|this presentation]] from Lucene Eurocon 2010 as well as [[http://fi= ndabilityblog.se/solr-processing-pipeline|this blog post]] for thoughts fro= m Find``Wise. = =3D Solution =3D Develop a simple, scalable, easily scriptable and configurable document p= rocessing framework for Solr, which builds on existing best practices. The = framework should be simple and lightweight enough for use with Solr single = node, and powerful enough to scale out in a separate document processing cl= uster, simply by changing configuration. @@ -26, +26 @@ =3D=3D=3D Must =3D=3D=3D * Apache licensed * Java based - * Lightweight + * Lightweight (not over-engineered) * Support for multiple named pipelines, addressable at document ingestion * Must work with existing Request``Handlers (XML, CSV, DIH, Binary etc) = as entry point * Allow as drop-in feature to existing installs (after upgrading to need= ed Solr version) @@ -53, +53 @@ * Wrappers for custom FAST ESP stages to work with minor modification = =3D Anti-patterns =3D - * Do not require all new APIs + * Do not require new APIs, but allow feeding through existing Update``Re= quest``Handlers = =3D Proposed architecture =3D Hook into the context of the existing UpdateRequestProcessorChain (integr= ate in Content``Stream``Handler``Base) by providing a dispatcher class, Sol= r``Pipeline``Dispatcher. The dispatcher would be enabled and configured thr= ough update parameters pipeline.name and pipeline.mode, either from the upd= ate request or in solrconfig.xml. = - Solr``Pipeline``Dispatcher would have two modes: "local" and "distributed= ". In case of local mode, the pipeline executes locally and results in the = ProcessorChain being completed with RunUpdateProcessorFactory submitting th= e content to local index. This would work well for single-node as well as l= ow load scenarios. + Solr``Pipeline``Dispatcher would have two possible modes: "local" and "di= stributed". In case of local mode, the pipeline executes locally and result= s in the ProcessorChain being completed with RunUpdateProcessorFactory subm= itting the content to local index. This would work well for single-node as = well as low load scenarios. Local mode is easiest to implement and could be= phase one. = - The "distributed" mode would enable more advanced dispatching (streaming)= to a cluster of remote worker nodes who executes the actual pipeline. This= means that indexing will not (necessarily) happen locally. Thus we introdu= ce the possibility for a Solr node which takes on the role of RequestHandle= r + Dispatcher only. = + We need a robust architecture for configuring and executing pipelines; pr= eferably multi threaded. We could start from scratch or base it on another = mature framework such as [[http://commons.apache.org/sandbox/pipeline/|Apac= he Commons Pipeline]], Open``Pipe or some other project with a compatible l= icense who are willing to donate to ASF. Apache Commons Pipeline is not dir= ectly what we need, it has a funny, somewhat rigid, stage architecture with= each stage having its own queue and thread(s) instead of running a whole p= ipeline in the same thread. = + =3D=3D Distributed mode =3D=3D + The "distributed" mode would enable more advanced dispatching (streaming)= to a cluster of remote worker nodes which execute the actual pipeline. Thi= s means that indexing will not happen locally. Thus a Solr node can take th= e role as RequestHandler + Pipeline``Dispatcher only, or as a Document Proc= essor only. The dispatcher streams output to a Request``Handler on the proc= essing node. When the pipeline has finished executing, the resulting docume= nts enter the Solr``Pipeline``Dispatcher again and get routed to the correc= t shard for indexing. As we can tell, there are some major devlopment effor= t to support distributed pipelines! - On the remote end, there will be a Solr installation with a new Pipeline`= `Request``Handler (cmd=3DprocessPipeline) which receives a stream of update= Requests and executes the correct pipeline. When the pipeline has finished = executing, the resulting documents enter the Solr``Pipeline``Dispatcher aga= in and gets dispatched to the correct shard for indexing. For this to work,= the shard ID must be configured or calculated somewhere (sounds like a goo= d time to introduce general distributed indexing!). - = - The shard masters which are the final targets for the pipeline will then = receive the processed documents through the Pipeline``Request``Handler (cmd= =3Dindex) and finalize indexing. - = - The pipeline itself could be based on [[http://commons.apache.org/sandbox= /pipeline/|Apache Commons Pipeline]] or some code from one of the other exi= sting pipeline projects. Benefit with Commons Pipeline is that it is alread= y an Apache library, built for scalability. However, it must perhaps be ada= pted to suit our needs. = =3D Risks =3D - * Automated distributed indexing is a larger problem + * Automated distributed indexing is a larger problem. Split the camel! * Introducing multiple worker nodes introduces sequencing issues and pot= ential deadlocks * Need sophisticated dispatching and scheduling code to make a robust an= d fault tolerant model = =3D Q&A =3D - Q: Your question here + =3D=3D Your question here =3D=3D - A: Answer here + Answer here =20