lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Regarding pdf indexing issue
Date Wed, 11 Jul 2018 15:07:57 GMT
Solr will not do this automatically, the Extracting Request Handler
simply indexes the entire contents of the doc without regard to things
like paragraphs etc. Ditto with HTML. This is actually a task that
requires getting into Tika and using all the bells and whistles there.

I'd recommend two things:

1> Take the PDF parsing offline, i.e. in a separate client. There are
many reasons for this, in particular you can attempt to do what you're
asking. See: https://lucidworks.com/2012/02/14/indexing-with-solrj/

2> Talk to the Tika folks about the best ways to make Tika return the
information such that you can index them and get what you'd like.

Best,
Erick

On Wed, Jul 11, 2018 at 6:35 AM, Rahul Prasad Dwivedi
<rdwivedi@bestpeers.com> wrote:
> Hello Team,
>
> I am using the Solr for indexing and searching for pdf document
>
> I have go through with your website document and installed solr but unable
> to index and search the document.
>
> For example: Suppose we have a PDF file which have no of paragraph with
> separate heading.
>
> So If I search for the title on indexed pdf the result should be contain
> the paragraph from where the title belongs.
>
> I am unable to perform this task.
>
> I have run the below command for upload the pdf
>
> *bin/post -c gettingstarted pdf-sample.pdf*
>
> and for searching I am running the command
>
> *curl http://localhost:8983/solr/gettingstarted/select?q='*
> <http://localhost:8983/solr/gettingstarted/select?q='*>'*
>
> Please suggest me anything and let me know if I am missing anything
>
> Thanks,
>
> Rahul

Mime
View raw message