Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: local policy)
MIME-version: 1.0
Content-transfer-encoding: 7BIT
Content-type: text/plain; charset=ISO-8859-1; format=flowed
Message-id: <5296544B.60902@sorint.it>
Date: Wed, 27 Nov 2013 21:21:31 +0100
From: Marcello Lorenzi <mlorenzi@sorint.it>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110
 Thunderbird/17.0.2
To: solr-user@lucene.apache.org
Cc: Erick Erickson <erickerickson@gmail.com>
Subject: Re: SolR vs large PDF
References: <52960920.5040103@sorint.it>
 <CAN4YXvefQjzw2z1SndD5BsYiqzkzcAn+3O4gCUEpuX7vZKGqCA@mail.gmail.com>
In-reply-to: 
 <CAN4YXvefQjzw2z1SndD5BsYiqzkzcAn+3O4gCUEpuX7vZKGqCA@mail.gmail.com>

Hi Erick,
On our architecture we use Apache Manifoldcf to invoke the schedulation 
from Manifold-web and we use the Manifold-agent to take the pdf file 
from the filesystem to SolR instances. Is it possibile to redirect the 
Manifold schedulation to the SolrJ instance for specific schedules?

Thanks,
Marcello

On 11/27/2013 06:14 PM, Erick Erickson wrote:
> I'm assuming you're using the ExtractingRequestHandler. Offloading
> the entire work onto your Solr box that is also serving queries
> and indexing is not going to scale well.
>
> Consider using Tika/SolrJ (Tika is what the ERH uses anyway) to
> offload the PDF parsing amongst as many clients as you can afford.
> Here's a way to get started:
>
> http://searchhub.org/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
>
> On Wed, Nov 27, 2013 at 10:00 AM, Marcello Lorenzi <mlorenzi@sorint.it>wrote:
>
>> Hi All,
>> on our test environment we have implemented a new search engine based on
>> Solr 4.3 with 2 instances hosted on different servers and 1 shard present
>> on each servlet container.
>>
>> During some stress test we noticed a bottleneck into crawling of large PDF
>> file that blocks the serving of results from queries to the collections.
>>
>> Is it possible to boost or mitigate the overhead created by PDFBOX during
>> the crawling?
>>
>> Thanks,
>> Marcello
>>