Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6153D200D2E for ; Tue, 17 Oct 2017 00:24:27 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 5FEF8160BE9; Mon, 16 Oct 2017 22:24:27 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A4FAD1609EF for ; Tue, 17 Oct 2017 00:24:26 +0200 (CEST) Received: (qmail 60460 invoked by uid 500); 16 Oct 2017 22:24:24 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 60448 invoked by uid 99); 16 Oct 2017 22:24:24 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Oct 2017 22:24:24 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id CFBAC180781 for ; Mon, 16 Oct 2017 22:24:23 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.541 X-Spam-Level: * X-Spam-Status: No, score=1.541 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_BRBL_LASTEXT=1.644, RP_MATCHES_RCVD=-0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=elyograg.org Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id Ui5fyPVcYgzH for ; Mon, 16 Oct 2017 22:24:22 +0000 (UTC) Received: from frodo.elyograg.org (frodo.elyograg.org [166.70.79.219]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 0ACBE5F239 for ; Mon, 16 Oct 2017 22:24:21 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by frodo.elyograg.org (Postfix) with ESMTP id EE052BC2 for ; Mon, 16 Oct 2017 16:24:18 -0600 (MDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=elyograg.org; h= content-language:content-transfer-encoding:content-type :content-type:in-reply-to:mime-version:user-agent:date:date :message-id:from:from:references:subject:subject:received :received; s=mail; t=1508192658; bh=qdq9hE/B3Tgbjx7xBtonHscNYwI5 gKOAafTyMSl8axU=; b=ovQOXsC/fHlbTDgliVMyTJJjr5lSIZe7CCltnG839Kju 3R/FslnqDyXjgjI949zH1ojtcOa1bEfWdzY1kPRdRvYmuVVq7PDdDHAXGCLhEevM OluC9XIKO03oO+sIPm9PH9bKuFRVu4yGGIGsyeLJhfFdZZjFklqxFEMk255JMaY= X-Virus-Scanned: Debian amavisd-new at frodo.elyograg.org Received: from frodo.elyograg.org ([127.0.0.1]) by localhost (frodo.elyograg.org [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id PL8Dwb8O52aF for ; Mon, 16 Oct 2017 16:24:18 -0600 (MDT) Received: from [10.2.0.108] (client175.mainstreamdata.com [209.63.42.175]) (using TLSv1 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: elyograg@elyograg.org) by frodo.elyograg.org (Postfix) with ESMTPSA id D77F3BC1 for ; Mon, 16 Oct 2017 16:24:12 -0600 (MDT) Subject: Re: Solr related questions To: solr-user@lucene.apache.org References: From: Shawn Heisey Message-ID: Date: Mon, 16 Oct 2017 16:24:05 -0600 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Content-Language: en-US archived-at: Mon, 16 Oct 2017 22:24:27 -0000 On 10/13/2017 5:50 AM, startrekfan wrote: > Thank you for your answer. > > To 3.) > The file is on server A, my program is on server B and solr is on server > C. If I use a normal http(rest) post, my program has to fetch the file > content from server A to Server B and then post it from server B to server > C as there is no open connection between A and C. So the file has to be > transmitted two times. > Is there a way to tell solr to read the file _directly_ from Server A (e.g. > via SMB) What exactly is in a "file" in this situation, and what does your service do with that file in order to decide what information gets sent to Solr?  This information will be vital to figuring out whether you can do what you're wanting to do. If your service does not have business-specific logic, and the files on your server are more generic, Solr does have the ability to "directly" index rich text files like PDF, Word, etc.  Typically the file is still sent to Solr even with that functionality.  I think there are ways to have it fetch the file, but I have no idea what kind of fetching is supported. There is one major issue with using that ability, called the Extracting Request Handler.  That functionality uses another piece of Apache software called Tika.  Because the exact structure of the documents that Tika supports can change subtly and not all of those formats are fully documented, Tika has a habit of exploding when it encounters something that its authors have never seen before.  If Tika is running inside Solr when it explodes, that explosion can take down the entire Solr process.  For that reason, we do not actually recommend running that functionality inside Solr, but rather in an external program that extracts information and sends it to Solr. The Tika authors do take such explosions seriously, and they do try to fix those problems when they are encountered.  It is impossible for the Tika project to prevent such problems from occurring, because there will always be documents produced that contain data formats that they've never seen before. Generally speaking, if you already have a well-tested way of extracting information from files and sending it to Solr, the recommendation is that you stick with that software, rather than try to get Solr to directly index your files. Thanks, Shawn