Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CA+BkBY4SZmvGsrSqnMqmtUFhCxpzrjM9XUZnOQDZRDLg=yB4NA@mail.gmail.com>
References: 
 <CA+BkBY4SZmvGsrSqnMqmtUFhCxpzrjM9XUZnOQDZRDLg=yB4NA@mail.gmail.com>
Date: Fri, 10 Jul 2015 09:47:45 -0700
Message-ID: 
 <CAN4YXvdyCQsGLix0aTU0-ctuhPJLctg8EjVtt5hysbuVX2NQvQ@mail.gmail.com>
Subject: Re: Get content in response from ExtractingRequestHandler
From: Erick Erickson <erickerickson@gmail.com>
To: solr-user@lucene.apache.org
Content-Type: text/plain; charset=UTF-8

In a word, no. If you don't store the data it is completely gone
with no chance of retrieval.

There are a couple of things to think about though

1> The original doc must exist somewhere. Store some kind
of URI in Solr that you can use to retrieve the original doc
on demand.

2> Go ahead and store the data. Disk space is cheap, and the
stored data goes in special files (*.fdt) that have very little impact
on either search speed or memory requirements. And the memory
requirements can be controlled somewhat with the documentCache
assuming you don't have gigantic docs.

This kind of sidesteps the question of re-extracting the document
on Solr on demand and returning the text (which I think is what
you're asking). I would  definitely avoid doing this even if I knew how.
The problem here is that you're making Solr do quite intensive
work (Tika extraction) while at the same time serving queries
what has negative performance implications. It it turns out that you
have to do this, consider running Tika in the app layer and
doing the extraction on demand there. It's not very hard, see:
https://lucidworks.com/blog/indexing-with-solrj/
and ignore the db bits.

Best,
Erick

On Thu, Jul 9, 2015 at 7:53 PM, trung.ht <trung.ht@anlab.vn> wrote:
> Hi everyone,
>
> I use solr to index and search in office file (docx, pptx, ...). To reduce
> the size of solr index, I do not store the content of the file on solr,
> however now my customer want to preview the content of the file.
>
> I have read the document of ExtractingRequestHandler, but it seems that to
> return content in the response from solr, the only option is to
> set extractOnly=true, but in that case, solr would not index the file.
>
> My question is: is there anyway for solr to extract the content from tika,
> index the content (without storing it) and then give me the content in the
> response?
>
> Thanks in advanced and sorry because my explanation is confusing.
>
> Trung.