lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Kelleher <mj.kelle...@gmail.com>
Subject Re: Document Processing
Date Mon, 05 Dec 2011 19:57:11 GMT
On 12/05/2011 01:52 PM, Michael Kelleher wrote:
> I am crawling a bunch of HTML pages within a site (using ManifoldCF), 
> that will be sent to Solr for indexing.  I want to extract some 
> content out of the pages, each piece of content to be stored as its 
> own field BEFORE indexing in Solr.
>
> My guess would be that I should use a Document processing pipeline in 
> Solr like UIMA, or something of the like.
>
> What would be the best way of handling this kind of processing?  Would 
> it be preferable to use a Document Processing Pipeline such as 
> OpenPipe, UIMA, etc?  Should this be handled externally, or would the 
> DataImportHandler suffice?
>
> The Solr server being used for this will solely be used for indexing, 
> and the "submit" jobs from the crawler will be very controlled, and 
> not high volume after the initial crawl.
>
> thanks.


Mime
View raw message