lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Trivial Update of "DocumentProcessing" by JanHoydahl
Date Mon, 06 Sep 2010 22:50:03 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DocumentProcessing" page has been changed by JanHoydahl.
The comment on this change is: Minors.
http://wiki.apache.org/solr/DocumentProcessing?action=diff&rev1=2&rev2=3

--------------------------------------------------

  
  However, the Solr community needs one single solution and more importantly a repository
of processing stages which can be shared and reused. The sharing part is crucial. If a company
develops, say a Geo``Names stage to translate address into lat/lon, the whole community can
benefit from that by fetching the stage from the shared repository. This will not happen as
long as there is not one single preferred integration point.
  
- There have recently been interest in the Solr community for such a framework. See [[http://lucene-eurocon.org/slides/A-Pipeline-for-Solr_Charas-Jansson.pdf|this
presentation]] from Lucene Eurocon 2010 for thoughts from Findwise, and now is the time to
move.
+ There have recently been interest in the Solr community for such a framework. See [[http://lucene-eurocon.org/slides/A-Pipeline-for-Solr_Charas-Jansson.pdf|this
presentation]] from Lucene Eurocon 2010 for thoughts from Find``Wise.
  
  = Solution =
  Develop a simple, scalable, easily scriptable and configurable document processing framework
for Solr, which builds on existing best practices. The framework should be simple and lightweight
enough for use with Solr single node, and powerful enough to scale out in a separate document
processing cluster, simply by changing configuration.
@@ -27, +27 @@

   * Lightweight
   * Support for multiple named pipelines, addressable at document ingestion
   * Must work with existing Request``Handlers (XML, CSV, DIH, Binary etc) as entry point
+  * Allow as drop-in feature to existing installs (after upgrading to needed Solr version)
   * Support for metadata on document and field level (e.g. tokenized=true, language=en)
   * Allow scaling out processing to multiple dedicated servers for heavy tasks
   * Well defined API for the processing stages
@@ -40, +41 @@

   * SDK for stage developers - to encourage stage development
   * Separate stage repository (outside of ASF svn) to encourage sharing
   * Integration points for UIMA, [[http://alias-i.com/lingpipe/|LingPipe]], [[http://opennlp.sourceforge.net/|OpenNLP]]
etc
+  * Integrate with Analysis so that if you tokenize in the Pipeline, analysis does not do
it over again.
+  * Allow re-use of TokenFilters from Analysis inside of Pipeline - avoid reinventing the
wheel
  
  === Could ===
   * GUI for configuring pipelines
   * Hot pluggable pipelines
   * Function as a standalone data integration framework outside the context of Solr
   * Wrappers for custom FAST ESP stages to work with minor modification
+ 
+ = Anti-patterns =
+  * Do not require all new APIs
  
  = Proposed architecture =
  Hook into the context of the existing UpdateRequestProcessorChain (integrate in Content``Stream``Handler``Base)
by providing a dispatcher class, Solr``Pipeline``Dispatcher. The dispatcher would be enabled
and configured through update parameters pipeline.name and pipeline.mode, either from the
update request or in solrconfig.xml.

Mime
View raw message