Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9D5F210557 for ; Wed, 27 Nov 2013 20:22:54 +0000 (UTC) Received: (qmail 10956 invoked by uid 500); 27 Nov 2013 20:22:50 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 10843 invoked by uid 500); 27 Nov 2013 20:22:50 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 10834 invoked by uid 99); 27 Nov 2013 20:22:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Nov 2013 20:22:50 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests=SPF_PASS,T_FRT_POSSIBLE X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [212.210.10.18] (HELO mail1.sorint.it) (212.210.10.18) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Nov 2013 20:22:44 +0000 X-AuditID: d4d20a12-b7f848e000003e65-50-5296547b8ebe Received: from mfe2.sorint.it (Unknown_Domain [212.210.10.162]) by mail1.sorint.it (Symantec Mail Security) with SMTP id E4.34.15973.B7456925; Wed, 27 Nov 2013 21:22:19 +0100 (CET) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; charset=ISO-8859-1; format=flowed Received: from [192.168.42.103] ([109.112.113.78]) by mfe2.sorint.it (Sun Java(tm) System Messaging Server 6.3-6.03 (built Mar 14 2008; 32bit)) with ESMTPSA id <0MWX00EBGV944SD0@mfe2.sorint.it> for solr-user@lucene.apache.org; Wed, 27 Nov 2013 21:22:17 +0100 (CET) Message-id: <5296544B.60902@sorint.it> Date: Wed, 27 Nov 2013 21:21:31 +0100 From: Marcello Lorenzi User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 To: solr-user@lucene.apache.org Cc: Erick Erickson Subject: Re: SolR vs large PDF References: <52960920.5040103@sorint.it> In-reply-to: X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFrrIIsWRmVeSWpSXmKPExsVy5RLXIt3qkGlBBrf+KFhMPX2AyYHR4/fl m4wBjFFcNimpOZllqUX6dglcGZ1ftrMWvOCqmP3yEnMD4xWOLkZODgkBE4kdzxazQ9hiEhfu rWfrYuTiEBLYxiix+no/G0iCV0BQ4sfkeyxdjBwczALyEgfPy4KEmQWsJVZO2sYIYgsJ9DFJ 7PgpBFGuJvFiQycLiM0ioCoxbeIJZhCbTUBb4t7kb2BxUYEIiR97t4L1igCNbO9qZIaYqSMx 5cF/JhBbWEBOYtWGG8wQ8/MlZrecAevlFAiWWN/8hnUCo8AsJNfNQrhuFpLrFjAyr2Lkz03M zDHSK84vAgadXmbJJkZIyAntYHz2WeQQowAHoxIPr4HjtCAh1sSy4srcQ4ySHExKoryPPYFC fEn5KZUZicUZ8UWlOanFhxglOJiVRHjV7IFyvCmJlVWpRfkwKWkOFiVxXiWn9QFCAumJJanZ qakFqUUwWRkODiUJ3oVBQI2CRanpqRVpmTklCGkmDk6Q4TxAwxWDQYYXFyTmFmemQ+RPMSpK ifNuAGkWAElklObB9cJSwitGcaBXhHl/g1TxAD0L1/0KaDAT0OAuo8kgg0sSEVJSDYybN5jv tHl6UMEo87MRP7sE67Z8y++Hu4UTZUVvhNRav0/T6Pp6TVN5pqXnt8OhIS8kXIon/ZmSu8/n kfWWK01s3yoTNlZeNHTYuTp0jZtO7Zr0PJv+3WVbueZmdx9Zd+i8jNLawkrdipKzwnXKS0KU y5oF1lydq98kGr5ZxMMo+pidG29luhJLcUaioRZzUXEiADhwO2HkAgAA X-Virus-Checked: Checked by ClamAV on apache.org Hi Erick, On our architecture we use Apache Manifoldcf to invoke the schedulation from Manifold-web and we use the Manifold-agent to take the pdf file from the filesystem to SolR instances. Is it possibile to redirect the Manifold schedulation to the SolrJ instance for specific schedules? Thanks, Marcello On 11/27/2013 06:14 PM, Erick Erickson wrote: > I'm assuming you're using the ExtractingRequestHandler. Offloading > the entire work onto your Solr box that is also serving queries > and indexing is not going to scale well. > > Consider using Tika/SolrJ (Tika is what the ERH uses anyway) to > offload the PDF parsing amongst as many clients as you can afford. > Here's a way to get started: > > http://searchhub.org/2012/02/14/indexing-with-solrj/ > > Best, > Erick > > > On Wed, Nov 27, 2013 at 10:00 AM, Marcello Lorenzi wrote: > >> Hi All, >> on our test environment we have implemented a new search engine based on >> Solr 4.3 with 2 instances hosted on different servers and 1 shard present >> on each servlet container. >> >> During some stress test we noticed a bottleneck into crawling of large PDF >> file that blocks the serving of results from queries to the collections. >> >> Is it possible to boost or mitigate the overhead created by PDFBOX during >> the crawling? >> >> Thanks, >> Marcello >>