lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pierluca Sangiorgi <pierluca.sangio...@gmail.com>
Subject Re: is solr the right choice for my pdf indexing purpose?
Date Tue, 12 Jun 2012 08:38:40 GMT
> To achieve this you will first want to update the schema.xml [1] to model
> your target fields - i.e. the ones you mention above. You will need to
> parse the PDF documents using something like Apache PDFBox[2] - good for if
> the documents are Acrobat Forms as you can get the form field contents - or
> Apache Tika[3] - if you want it as a String -  to get the contents. This
> will allow you to extract the field values from content using pattern
> matching. The fields can then be added to a document and posted to Solr
> using Solrj.

Thanks for the answer.
I'm currently using the Solr Cell Update Request Handler as
ContentStreamUpdateRequest in the Solrj, so Tika is used
"automatically" but it extracts the content directly into field.
Do i must use Tika in "standalone way" to capture content as a string,
built my custom document (xml o json) and then use the correspondig
Update Request Handler, right?
Any suggestions on pattern matching / information retrieval /
information extraction module to create my custom document from string
extracted by Tika?

thanks
Luca

Mime
View raw message