Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ABB78185A2 for ; Fri, 10 Jul 2015 16:48:02 +0000 (UTC) Received: (qmail 95206 invoked by uid 500); 10 Jul 2015 16:47:57 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 95136 invoked by uid 500); 10 Jul 2015 16:47:57 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 95124 invoked by uid 99); 10 Jul 2015 16:47:57 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Jul 2015 16:47:57 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id EDF6FC0711 for ; Fri, 10 Jul 2015 16:47:56 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.1 X-Spam-Level: X-Spam-Status: No, score=-0.1 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id i8sIViYtyg5B for ; Fri, 10 Jul 2015 16:47:45 +0000 (UTC) Received: from mail-ie0-f178.google.com (mail-ie0-f178.google.com [209.85.223.178]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 839AC43BB0 for ; Fri, 10 Jul 2015 16:47:45 +0000 (UTC) Received: by iebmu5 with SMTP id mu5so199481764ieb.1 for ; Fri, 10 Jul 2015 09:47:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=ZsokK8uKiPJGlJWzD79xVEPdh3Dup6YKSogrD0QR+iI=; b=u1f78ELnqnGXuf0mUTjyi6dM+pXTyYwRMkkxvdjYRAir6QSJo3wOxdzsQkT2rSGVsf 88ldMO+FHltIdrpAH9QTIhTBLW6jFVe+iiSP7alq131/snLd0HaPjOqclyZoM7Vl6BC9 p8aUaPwVl6WKKNFiAonlJID+T85Zq7PM3zbp1+Q5RIT+BQ5VAVZdK7XypkHjGfzTX8pz SNhIjJQPPx+OJmSUO1bfMRrbwHeXr3iPgpKnABgdHROl29iybjLa5AuuesWj6iDUJe9d yXiSn4SqBn/AaaIlbKCc07Y0x46RtlGnc/R1fQlXltPJf9L6b2u+b8m4BNEhJlhPCjwH ih9w== MIME-Version: 1.0 X-Received: by 10.107.8.17 with SMTP id 17mr4686621ioi.15.1436546865121; Fri, 10 Jul 2015 09:47:45 -0700 (PDT) Received: by 10.107.154.145 with HTTP; Fri, 10 Jul 2015 09:47:45 -0700 (PDT) In-Reply-To: References: Date: Fri, 10 Jul 2015 09:47:45 -0700 Message-ID: Subject: Re: Get content in response from ExtractingRequestHandler From: Erick Erickson To: solr-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 In a word, no. If you don't store the data it is completely gone with no chance of retrieval. There are a couple of things to think about though 1> The original doc must exist somewhere. Store some kind of URI in Solr that you can use to retrieve the original doc on demand. 2> Go ahead and store the data. Disk space is cheap, and the stored data goes in special files (*.fdt) that have very little impact on either search speed or memory requirements. And the memory requirements can be controlled somewhat with the documentCache assuming you don't have gigantic docs. This kind of sidesteps the question of re-extracting the document on Solr on demand and returning the text (which I think is what you're asking). I would definitely avoid doing this even if I knew how. The problem here is that you're making Solr do quite intensive work (Tika extraction) while at the same time serving queries what has negative performance implications. It it turns out that you have to do this, consider running Tika in the app layer and doing the extraction on demand there. It's not very hard, see: https://lucidworks.com/blog/indexing-with-solrj/ and ignore the db bits. Best, Erick On Thu, Jul 9, 2015 at 7:53 PM, trung.ht wrote: > Hi everyone, > > I use solr to index and search in office file (docx, pptx, ...). To reduce > the size of solr index, I do not store the content of the file on solr, > however now my customer want to preview the content of the file. > > I have read the document of ExtractingRequestHandler, but it seems that to > return content in the response from solr, the only option is to > set extractOnly=true, but in that case, solr would not index the file. > > My question is: is there anyway for solr to extract the content from tika, > index the content (without storing it) and then give me the content in the > response? > > Thanks in advanced and sorry because my explanation is confusing. > > Trung.