lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "DocumentProcessing" by JanHoydahl
Date Fri, 21 Oct 2011 15:33:53 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DocumentProcessing" page has been changed by JanHoydahl:
http://wiki.apache.org/solr/DocumentProcessing?action=diff&rev1=20&rev2=21

  
  The built-in UpdateRequestProcessorChain is capable of doing simple simple processing jobs,
but it is only built for local execution on the indexer node in the same thread. This means
that any performance heavy processing chains will slow down the indexers without any way to
scale out processing independently. We have seen FAST systems with far more servers doing
document processing than indexing.
  
- There are many processing pipeline frameworks from which to get inspiration, such as the
one in FAST ESP, [[http://www.openpipeline.org/|OpenPipeline]], [[http://openpipe.berlios.de/|OpenPipe]]
(now on [[https://github.com/kolstae/openpipe|GitHub]]), [[http://www.pypes.org/|Pypes]],
[[http://uima.apache.org/|UIMA]], [[http://www.eclipse.org/smila/|Eclipse SMILA]], [[http://commons.apache.org/sandbox/pipeline/|Apache
commons pipeline]], [[http://www.piped.io/|Piped]], [[https://github.com/jnioche/behemoth|Behemot]]
and others. Indeed, some of these are already being used with Solr as a pre-processing server.
This means weak coupling but also weak re-use of code. Each new project will have to choose
which of the pipelines to invest in.
+ There are many processing pipeline frameworks from which to get inspiration, such as the
one in FAST ESP, [[http://www.openpipeline.org/|OpenPipeline]], [[http://openpipe.berlios.de/|OpenPipe]]
(now on [[https://github.com/kolstae/openpipe|GitHub]]), [[http://www.pypes.org/|Pypes]],
[[http://uima.apache.org/|UIMA]], [[http://www.eclipse.org/smila/|Eclipse SMILA]], [[http://commons.apache.org/sandbox/pipeline/|Apache
commons pipeline]], [[http://www.piped.io/|Piped]], [[https://github.com/jnioche/behemoth|Behemoth]],
Findwise's yet-to-be-announced pipeline and others. Indeed, some of these are already being
used with Solr as a pre-processing server.
  
- The community would benefit from an official processing framework -- and more importantly
an official repository of processing stages which are shared and reused. The sharing part
is crucial. If a company develops, say a GeoNames stage to translate address into lat/lon,
the whole community can benefit from that by fetching the stage from the shared repository.
This will not happen as long as there is not one single preferred integration point.
+ A choice of technologies is good, but it can be a bit too much and fragmented as well...
  
- There have recently been interest in the Solr community for such a framework. See [[http://lucene-eurocon.org/slides/A-Pipeline-for-Solr_Charas-Jansson.pdf|this
presentation]] from Lucene Eurocon 2010 as well as [[http://findabilityblog.se/solr-processing-pipeline|this
blog post]] for thoughts from FindWise, as well as the recent solr-user thread [[http://search-lucene.com/m/pFegS7BQ7k2|Pipeline
for Solr]] and Cominvent's talk at Lucene Eurocon 2011 [[http://www.slideshare.net/janhoy/improving-the-solr-update-chain|Improving
Solr's Update Chain]].
+ There have recently been interest within the search community for a true open source pipeline
with a healthy community behind it and a rich pool of processors. See [[http://lucene-eurocon.org/slides/A-Pipeline-for-Solr_Charas-Jansson.pdf|this
presentation]] from Lucene Eurocon 2010 as well as [[http://findabilityblog.se/solr-processing-pipeline|this
blog post]] for thoughts from FindWise, as well as the recent solr-user thread [[http://search-lucene.com/m/pFegS7BQ7k2|Pipeline
for Solr]] and Cominvent's talk at Lucene Eurocon 2011 [[http://www.slideshare.net/janhoy/improving-the-solr-update-chain|Improving
Solr's Update Chain]]. In addition to developing a true open source preferred solution, it
should also be possible to improve interoperability and compatibility.
  
- = Solution proposal =
- Develop a simple, scalable, easily scriptable and configurable document processing framework
for Solr, which builds on existing best practices. The framework should be simple and lightweight
enough for use with Solr single node, and powerful enough to scale out in a separate document
processing cluster, simply by changing configuration.
+ Here are a few things that we could consider in order to ease this situation:
+  * Start talking together and try find common ground, places to cooperate, consolidate etc
+  * Develop a common Java interface which models a document processor, enabling cross-pipeline
use of the same processor
+  * Develop a Java wrapper for executing Python processors (reuse of ESP processors, Pypes
processors and Piped processors) in a Java pipeline
+  * Specify a common "Document" model which may be serialized between various components
(Avro based?)
+  * Establish a source repository (outside of the ASF) of reusable processors, maintained
by a large community
  
- NOTE: It is not given that the code needs to be part of the Solr/Lucene codebase itself.
It could start its life somewhere else, and perhaps later become an Apache project of its
own.
+ *Update*: At Lucene Eurocon 2011 in Barcelona, [[http://twitter.com/#!/cominvent/status/126997829121093632|we]]
formed an interest group for pipelines and will eventually setup a MeetUp group to continue
integration talks.
+ 
+ 
+ 
+ = Wishes for a Lucene targeted pipeline =
+ Here are some thoughts and wishes for a new pipeline project mainly target at Lucene based
search enginens (including Solr, ElasticSearch and Lucene itself). It should probably build
upon/fork one of the existing projects and best practices.
  
  == Key requirements ==
  === Must ===
@@ -32, +41 @@

   * Java based
   * Lightweight (not over-engineered)
   * Support for multiple named pipelines, addressable at document ingestion
+  * Support for a rich document format, including token streams (pre-analyzed content)
   * Support for metadata on document and field level (e.g. tokenized=true, language=en)
-  * Allow scaling out processing to multiple dedicated servers for heavy tasks. Cloud-friendly
-  * Well defined API and SDK for the processing stages
+  * Well defined dead-simple API and SDK for the processing stages
   * Easy configuration of pipelines through separate config and GUI
+  * Run standalone as well as embedded in another framework (such as Solr's UpdateChain)
+  * Do not directly depend on Solr, but allow easy, tight integration with either Lucene
or Solr
  
  === Should ===
-  * Function as a standalone data integration framework outside the context of Solr
-  * Allow drop-in integration with existing Solr installs, i.e. accept documents from any
Solr client including [[http://incubator.apache.org/connectors/|ManifoldCF]]
+  * SDK for stage developers - to encourage stage development
+  * Easily debuggable and testable
+  * Separate stages repository (e.g. a gitHub space, outside of ASF svn) to encourage sharing
+  * Integration points for UIMA, [[http://alias-i.com/lingpipe/|LingPipe]], [[http://opennlp.sourceforge.net/|OpenNLP]]
etc
+  * Be able to run Lucene's Tokenizers and Token Filters directly and ship this to Lucene
as the new "pre-analyzed" field (see [[https://issues.apache.org/jira/browse/SOLR-1535|SOLR-1535]])
   * Support for writing stages in JVM scripting languages such as Jython
-  * Robust - if a batch fails, it should re-schedule to another processor
-  * Optimize for performance through e.g. batch support
-  * Support status callbacks to the client
-  * SDK for stage developers - to encourage stage development
-  * Separate stages repository (outside of ASF svn) to encourage sharing
-  * Integration points for UIMA, [[http://alias-i.com/lingpipe/|LingPipe]], [[http://opennlp.sourceforge.net/|OpenNLP]]
etc
-  * Integrate with Analysis so that if you tokenize in the Pipeline, analysis does not do
it over again.
-  * Allow re-use of TokenFilters from Analysis inside of Pipeline - avoid reinventing the
wheel
  
  === Could ===
   * GUI for configuring pipelines
   * Hot pluggable pipelines
   * Wrappers for custom FAST ESP stages to work with minor modification
   * Wrappers for custom UpdateProcessor stages to work with minor modification
+  * Robust - if a batch fails, it should re-schedule to another processor
+  * Optimize for performance through e.g. batch support
+  * Allow scaling out processing to multiple dedicated servers for heavy tasks. Cloud-friendly
+  * Support status callbacks to the client
  
  = Anti-patterns =
   * Do not over-architecture like Eclipse SMILA and others have done going crazy with ESB
etc
@@ -62, +72 @@

   * Do not keep the source private (although Apache licensed) as DieselPoint did with OpenPipeline
- create a community!
  
  = Proposed architecture =
+ Jan H√łydahl: I think OpenPipe is a hot candidate to fork as a new open source framework.
It already supports most of the above, is Apache licensed, and is abandoned by its original
developers.
- [[https://docs.google.com/drawings/edit?id=1rVsy-p7sexSw3wrald2_fHtkLk6opYp5qzllvOHOB8c&hl=en|Architecture
diagram]] (Request edit permission if you want to edit)
- 
- A good starting point for the core (standalone) pipeline could be the Apache-licensed [[http://openpipe.berlios.de/|OpenPipe]],
which already works stand-alone.
- 
- Glue code to hook the pipeline into Solr could be an UpdateRequestProcessor which can either
work in "local" mode, executing the pipeline locally in-thread, or in "distributed" mode which
would dispatch the batch to an available node in a document processing cluster.
- 
- I envision that the whole pipeline could (in addition to running standalone) be wrapped
in a Solr RequestHandler i.e. a Document-processing-only node would be an instance of Solr
with a new BinaryDocumentRequestHandler, without a local index. When processing is finished,
the documents are routed to the final destination for indexing (perhpas using [[https://issues.apache.org/jira/browse/SOLR-2358|SOLR-2358]]).
- 
- The architecture diagram above shows the local and the fully distributed cases. Another
option would be to round-robin feeding to the set of pipeline nodes directly (not needing
a BinaryDocumentRequestHandler), and letting them do the distributed indexing as the last
UdateProcessor.
  
  = Risks =
+ TBD
-  * Automated distributed indexing [[https://issues.apache.org/jira/browse/SOLR-2358|SOLR-2358]]
needs to work with this
-  * Introducing multiple worker nodes introduces sequencing issues and potential deadlocks
-  * Need sophisticated dispatching and scheduling code to make a robust and fault tolerant
model
  
  = Q&A =
  == Your question here ==
-  * Q: Is there a JIRA issue that tracks the development of this feature?
+  * Q: Is there a JIRA issue that tracks the Solr-side development of this?
   * A: Not yet
  
   * Q: How is this related to https://issues.apache.org/jira/browse/SOLR-2129?
-  * A: SOLR-2129 is an UpdateProcessor for UIMA (see [[http://wiki.apache.org/solr/SolrUIMA|SolrUIMA]]).
Here we're talking about improving the whole UpdateProcessor framework, either by enhancing
the existing or creating a new project.
+  * A: SOLR-2129 is an UpdateProcessor for UIMA (see [[http://wiki.apache.org/solr/SolrUIMA|SolrUIMA]]).
Here we're talking about a new standalone framework and a way to integrate this and other
existing pipelines cleanly with Solr/Lucene.
  
   * Q: Will the pipelines have to be linear. For instance, could we implement a first stage
in the pipeline that would be a splitter. The splitter could, for example, break up a large
XML document into chapters, then push each chapter to the next stage where other processing
will take place. In the end, the Lucene index would have one document per chapter.
-  * A: In [[https://issues.apache.org/jira/browse/SOLR-2841|SOLR-2841]] we suggest a way
to make pipelines non linear. For splitting in chapters however, I think that a UpdateRequestHandler
may be a better choice, see http://wiki.apache.org/solr/XsltUpdateRequestHandler
+  * A: The new framework can be however we want it. If you talk about the Solr UpdateChain,
we suggest in [[https://issues.apache.org/jira/browse/SOLR-2841|SOLR-2841]] a way to support
non linear chains. For splitting in chapters however, I think that a UpdateRequestHandler
may be a better choice, see http://wiki.apache.org/solr/XsltUpdateRequestHandler
  
   * Q: How will the pipelines support compound files, e.g. archives, e-mail messages with
attachments (which could be archives), etc.? This could be a problem if pipelines are linear.
-  * A: Again, you have a choice whether your UpdateRequestHandler should understand the input
format and do the splitting for you. But it should also be possible to write an UpdateProcessor
which splits the incoming SolrInputDocument into multiple sub documents - generating unique
IDs for each. You would somehow need to inject these sub documents again, either by using
SolrJ from your UpdateProcessor or by instantiating a "sub chain" in another thread to push
the sub docs into the index. This is however, left as an exercise for the user :)
+  * A: This is an open question. For the new pipeline framework, there are many possibilities,
which must be discussed. If you're thinking about the Solr UpdateChain, you have a choice
whether your UpdateRequestHandler should understand the input format and do the splitting
for you. But it should also be possible to write an UpdateProcessor which splits the incoming
SolrInputDocument into multiple sub documents - generating unique IDs for each. You would
somehow need to inject these sub documents again, either by using SolrJ from your UpdateProcessor
or by instantiating a "sub chain" in another thread to push the sub docs into the index. This
is however, left as an exercise for the user :)
  

Mime
View raw message