lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcello Lorenzi <>
Subject Re: SolR vs large PDF
Date Wed, 27 Nov 2013 20:21:31 GMT
Hi Erick,
On our architecture we use Apache Manifoldcf to invoke the schedulation 
from Manifold-web and we use the Manifold-agent to take the pdf file 
from the filesystem to SolR instances. Is it possibile to redirect the 
Manifold schedulation to the SolrJ instance for specific schedules?


On 11/27/2013 06:14 PM, Erick Erickson wrote:
> I'm assuming you're using the ExtractingRequestHandler. Offloading
> the entire work onto your Solr box that is also serving queries
> and indexing is not going to scale well.
> Consider using Tika/SolrJ (Tika is what the ERH uses anyway) to
> offload the PDF parsing amongst as many clients as you can afford.
> Here's a way to get started:
> Best,
> Erick
> On Wed, Nov 27, 2013 at 10:00 AM, Marcello Lorenzi <>wrote:
>> Hi All,
>> on our test environment we have implemented a new search engine based on
>> Solr 4.3 with 2 instances hosted on different servers and 1 shard present
>> on each servlet container.
>> During some stress test we noticed a bottleneck into crawling of large PDF
>> file that blocks the serving of results from queries to the collections.
>> Is it possible to boost or mitigate the overhead created by PDFBOX during
>> the crawling?
>> Thanks,
>> Marcello

View raw message