Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 21EF0200B21 for ; Fri, 10 Jun 2016 10:22:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 20865160A38; Fri, 10 Jun 2016 08:22:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 66433160A04 for ; Fri, 10 Jun 2016 10:22:10 +0200 (CEST) Received: (qmail 72712 invoked by uid 500); 10 Jun 2016 08:22:08 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 72699 invoked by uid 99); 10 Jun 2016 08:22:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Jun 2016 08:22:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id C9350C0C1D for ; Fri, 10 Jun 2016 08:22:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.38 X-Spam-Level: X-Spam-Status: No, score=0.38 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, KAM_COUK=1.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=flax-co-uk.20150623.gappssmtp.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id ngWDE-7BO_R7 for ; Fri, 10 Jun 2016 08:22:04 +0000 (UTC) Received: from mail-wm0-f41.google.com (mail-wm0-f41.google.com [74.125.82.41]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id 95A6F5F36F for ; Fri, 10 Jun 2016 08:22:03 +0000 (UTC) Received: by mail-wm0-f41.google.com with SMTP id m124so91354063wme.1 for ; Fri, 10 Jun 2016 01:22:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=flax-co-uk.20150623.gappssmtp.com; s=20150623; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-transfer-encoding; bh=uaN5opJ9eyD+smXlhDi++M7kD83rZDTJG+eKgoM84cU=; b=dz302WwbECpNh2trWkuZEgaTpHMkAsYBDx/Hd37ACIfHDGi4BbIy6n1wdee9bTJf5y vGS44jqD1ep7QBy4leN31z+i2rQIdOThdnEjN9IhdIkFcxdIeDjVsUKsCBGwyyqFv4g2 KPe3sMhF3uHwGTY6WUiKCNawRKpfGhZeo5/q+SIH4BTDYEwJSlJetBNQ0d0+RYSBs/JK dmuJCTFIICldGIF9Q4v/B3P+Z9vFxcIqghxjCeiLDeL3SK8A2DjeGlwiNDwk7uyuLo8o W8u3mE0xOZRg9VDnc6ecWNdR7AMmfVeIX2zaIfrl8OJP49tCwi8Zank5WHtVNjNFP7r5 /ldQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding; bh=uaN5opJ9eyD+smXlhDi++M7kD83rZDTJG+eKgoM84cU=; b=W5gfZo5E4tWy4Ov46xgIx/u1MJ8Zkceo10nzPs2+rpeIDt1mzyRmCX77emexCHZlF6 Ooalq2ZfXuCbV3C7qqI9xdz+FlaoBDfZK00Di9ae/1Da/uDA0APSsWFCYY4gZHUcMGPG VqHJi8d3wmZKFK6gNA6eSFYEL9rooXOhbmqOfff+QC1N6pZM4W6nIkud7sIRv02/tig+ JptC0lXSvJjXNI+w7VsiL3k7P4rJ6BLHJBfbYvuBlwAbdlNMtzhv74Aa12D2C0nZHUeh b2TvmU+tXuLUqDO+YXkKzWP2mhi2FL9gW10TtB5AkJ8/KVHhpk07O7RwWLipKayXJxXu 6U4g== X-Gm-Message-State: ALyK8tL8M5ytIMvPDyMVi9R6QAGMtwXun5dbxgPRuiXOF6GlkincJMlQ1UTORjYd+P3tFQ== X-Received: by 10.28.23.84 with SMTP id 81mr18144930wmx.46.1465546923105; Fri, 10 Jun 2016 01:22:03 -0700 (PDT) Received: from [192.168.1.90] ([37.152.203.112]) by smtp.googlemail.com with ESMTPSA id l9sm11108247wjm.0.2016.06.10.01.22.01 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 10 Jun 2016 01:22:01 -0700 (PDT) Subject: Re: Bypassing ExtractingRequestHandler To: solr-user@lucene.apache.org References: From: Charlie Hull Message-ID: <77a2386d-21de-60f2-e497-97b39873131e@flax.co.uk> Date: Fri, 10 Jun 2016 09:22:04 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit archived-at: Fri, 10 Jun 2016 08:22:11 -0000 On 10/06/2016 02:20, Justin Lee wrote: > Has anybody had any experience bypassing ExtractingRequestHandler and > simply managing Tika manually? I want to make a small modification to Tika > to get and save additional data from my PDFs, but I have been > procrastinating in no small part due to the unpleasant prospect of setting > up a development environment where I could compile and debug modifications > that might run through PDFBox, Tika, and ExtractingRequestHandler. It > occurs to me that it would be much easier if the two were separate, so I > could have direct control over Tika and just submit the text to Solr after > extraction. Am I going to regret this approach? I'm not sure what > ExtractingRequestHandler really does for me that Tika doesn't already do. We tend to prefer running Tika externally as it's entirely possible that Tika will crash or hang with certain files - and that will bring down Solr if you're running Tika within it. Here's a Dropwizard wrapper around Tika that might be of use: https://github.com/mattflax/dropwizard-tika-server Cheers Charlie > > Also, I was reading this > > stackoverflow entry and someone offhandedly mentioned that > ExtractingRequestHandler might be separated in the future anyway. Is there > a public roadmap for the project, or does one have to keep up with the > developer's mailing list and hunt through JIRA entries to keep up with the > pulse of the project? > > Thanks, > Justin > -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk