lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aditya" <aditya.kulka...@gmail.com>
Subject RE: Problem with PDF extraction
Date Tue, 27 Apr 2010 07:34:29 GMT
I too faced similar problem. 

 

May I suggest trying pdftotext? This I observed being used by Google
Desktop.

 

http://www.foolabs.com/xpdf/download.html

 

AFAIK it is under GNU GENERAL PUBLIC LICENSE.

 

Best Regards,

Aditya

 

From: Grant Ingersoll [mailto:gsiasf@gmail.com] On Behalf Of Grant Ingersoll
Sent: Tuesday, April 27, 2010 3:38 AM
To: dev@lucene.apache.org
Subject: Re: Problem with PDF extraction

 

Hi Marc,

 

Can you ask on solr-user@lucene.apache.org and give more information about
any errors that occur in your Solr log plus the setup of the
ExtractingRequestHandler and related schema.

 

-Grant

 

On Apr 26, 2010, at 5:04 PM, Marc Ghorayeb wrote:





Hello,

 

I have been having problems with PDF randomly crashing the 1.4 Solr server
so i tried out the SVN version which contains a newer Tika library. On its
own, the tika app extracts correctly the content of my PDF. However, inside
Solr, when i upload a pdf file to my update/extract handler, it does not
seem to parse it (a blank file is outputted...). The literal values do get
indexed though. I have had no luck in getting the tika parsing to work. For
some reason, i get the same result whether or not the tika-parsers-0.7.jar
is present in the lib folder. Whereas if the tika-core-0.7 jar is absent, it
just crashes (which seems normal to me...).

 

I don't seem to be the only one having this problem (on the user mailing
list that is). Can anyone help me out? It would be greatly appreciated.

 

I use a fairly classic schema and default requesthandlers.

 

Marc Ghorayeb.

 

  _____  

Hotmail débarque sur votre téléphone ! Paramétrez
<http://www.messengersurvotremobile.com/?d=Hotmail>  Hotmail sur votre
téléphone! Gratuit !

 

--------------------------

Grant Ingersoll

http://www.lucidimagination.com/

 

Search the Lucene ecosystem using Solr/Lucene:
http://www.lucidimagination.com/search

 


Mime
View raw message