lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Meikle <>
Subject Re: is solr the right choice for my pdf indexing purpose?
Date Mon, 11 Jun 2012 21:33:31 GMT

On 11 June 2012 19:42, Pierluca Sangiorgi <>wrote:

> As example: I've a pdf document that contain an invoice. I need to
> extract and index informations relative to recipient, price, sold
> items, items description, and so on.
> Is Solr the right choice for this purpose or do i need to use other
> framework in addiction before posting document to Solr?

Solr is a good choice, especially if you want to start to leverage the
power of search, but you will need to do a bit of work before hand if you
want to split the information out to give you the power to make best use of
it later.

To achieve this you will first want to update the schema.xml [1] to model
your target fields - i.e. the ones you mention above. You will need to
parse the PDF documents using something like Apache PDFBox[2] - good for if
the documents are Acrobat Forms as you can get the form field contents - or
Apache Tika[3] - if you want it as a String -  to get the contents. This
will allow you to extract the field values from content using pattern
matching. The fields can then be added to a document and posted to Solr
using Solrj.



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message