manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Document Processing
Date Mon, 05 Dec 2011 18:51:26 GMT
Solr is really designed for this kind of processing and
configurability; the Solr connector is just concerned about getting
the documents to Solr.  So I think your best bet is to either use
existing Solr pipeline infrastructure, or write your own update
handler that does what you need.  (Obviously the former is preferred
over the latter...)

Karl


On Mon, Dec 5, 2011 at 1:45 PM, Michael Kelleher <mj.kelleher@gmail.com> wrote:
> I am crawling a bunch of HTML pages within a site, that will be sent to Solr
> for indexing.  I want to extract some content out of the pages, each piece
> of content to be stored as its own field BEFORE indexing in Solr.
>
> My guess would be that I should use a Document processing pipeline in Solr
> like UIMA, or something of the like.
>
> However, to limit the amount of load on Solr, I was wondering if there was a
> way to "hook" into the Solr connector to create these additional fields /
> handle this processing.  Maybe this would be an "extended" Solr connector
> that I would create.
>
> Or should this really be done within Solr, because Solr already handles this
> kind of processing?
>
> Any guidance / help would be great.
>
> thanks.

Mime
View raw message