lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: SolR vs large PDF
Date Wed, 27 Nov 2013 17:14:49 GMT
I'm assuming you're using the ExtractingRequestHandler. Offloading
the entire work onto your Solr box that is also serving queries
and indexing is not going to scale well.

Consider using Tika/SolrJ (Tika is what the ERH uses anyway) to
offload the PDF parsing amongst as many clients as you can afford.
Here's a way to get started:

http://searchhub.org/2012/02/14/indexing-with-solrj/

Best,
Erick


On Wed, Nov 27, 2013 at 10:00 AM, Marcello Lorenzi <mlorenzi@sorint.it>wrote:

> Hi All,
> on our test environment we have implemented a new search engine based on
> Solr 4.3 with 2 instances hosted on different servers and 1 shard present
> on each servlet container.
>
> During some stress test we noticed a bottleneck into crawling of large PDF
> file that blocks the serving of results from queries to the collections.
>
> Is it possible to boost or mitigate the overhead created by PDFBOX during
> the crawling?
>
> Thanks,
> Marcello
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message