lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "DocumentProcessing" by JanHoydahl
Date Tue, 18 Oct 2011 20:49:37 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DocumentProcessing" page has been changed by JanHoydahl:
http://wiki.apache.org/solr/DocumentProcessing?action=diff&rev1=16&rev2=17

Comment:
Answering some questions

   * A: Not yet
  
   * Q: How is this related to https://issues.apache.org/jira/browse/SOLR-2129?
-  * A: SOLR-2129 is an UpdateProcessor for UIMA (see [[http://wiki.apache.org/solr/SolrUIMA|SolrUIMA]]).
Here we're talking about improving the whole UpdateProcessor framework, either by replacing
it or enhancing the existing.
+  * A: SOLR-2129 is an UpdateProcessor for UIMA (see [[http://wiki.apache.org/solr/SolrUIMA|SolrUIMA]]).
Here we're talking about improving the whole UpdateProcessor framework, either by enhancing
the existing or creating a new project.
  
   * Q: Will the pipelines have to be linear. For instance, could we implement a first stage
in the pipeline that would be a splitter. The splitter could, for example, break up a large
XML document into chapters, then push each chapter to the next stage where other processing
will take place. In the end, the Lucene index would have one document per chapter.
-  * A:
+  * A: In [[https://issues.apache.org/jira/browse/SOLR-2841|SOLR-2841]] we suggest a way
to make pipelines non linear. For splitting in chapters however, I think that a UpdateRequestHandler
may be a better choice, see http://wiki.apache.org/solr/XsltUpdateRequestHandler
  
   * Q: How will the pipelines support compound files, e.g. archives, e-mail messages with
attachments (which could be archives), etc.? This could be a problem if pipelines are linear.
-  * A:
+  * A: Again, you have a choice whether your UpdateRequestHandler should understand the input
format and do the splitting for you. But it should also be possible to write an UpdateProcessor
which splits the incoming SolrInputDocument into multiple sub documents - generating unique
IDs for each. You would somehow need to inject these sub documents again, either by using
SolrJ from your UpdateProcessor or by instantiating a "sub chain" in another thread to push
the sub docs into the index. This is however, left as an exercise for the user :)
  

Mime
View raw message