lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl / Cominvent <jan....@cominvent.com>
Subject Re: Solr Cell revamped as an UpdateProcessor?
Date Fri, 22 Jan 2010 22:37:02 GMT
On 8. des. 2009, at 00.29, Grant Ingersoll wrote:
> On Dec 7, 2009, at 3:51 PM, Chris Hostetter wrote:
>> ASs someone with very little knowledge of Solr Cell and/or Tika, I find myself wondering
if ExtractingRequestHandler would make more sense as an extractingUpdateProcessor -- where
it could be configured to take take either binary fields (or string fields containing URLs)
out of the Documents, parse them with tika, and add the various XPath matching hunks of text
back into the document as new fields.
>> 
>> Then ExtractingRequestHandler just becomes a handler that slurps up it's ContentStreams
and adds them as binary data fields and adds the other literal params as fields.
>> 
>> Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths in XML
and CSV based updates fairly trivial?
> 
> It probably could, but am not sure how it works in a processor chain.  However, I'm not
sure I understand how they work all that much either.  I also plan on adding, BTW, a SolrJ
client for Tika that does the extraction on the client.  In many cases, the ExtrReqHandler
is really only designed for lighter weight extraction cases, as one would simply not want
to send that much rich content over the wire.

Good match. UpdateProcessors is the way to go for functionality which modifiy documents prior
to indexing.
With this, we can mix and match any type of content source with other processing needs.

I think it can be neneficial to have the choice to do extration on the SolrJ side. But you
don't always have that choice, if your source is a crawler without built-in Tika, some base64
encoded field in an XML or some other random source, you want to do the extraction at an arbitrary
place in the chain.

Examples:
  Crawler (httpheaders, binarybody) -> TikaUpdateProcessor (+title, +text, +meta...) ->
index
  XML (title, pdfurl) -> GetUrlProcessor (+pdfbin) -> TikaUpdateProcessor (+text, +meta)
-> index
  DIH (city, street, lat, lon) -> LatLon2GeoHashProcessor (+geohash) -> index

I propose to model the document processor chain more after FAST ESP's flexible processing
chain, which must be seen as an industry best practice. I'm thinking of starting a Wiki page
to model what direction we should go.

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com


Mime
View raw message