lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandhya Agarwal <sagar...@opentext.com>
Subject RE: Problem with pdf, upgrading Cell
Date Wed, 05 May 2010 04:36:01 GMT
Praveen,



I only have the highlighted jars copied. Not sure, if we need the other jars. Also, I copied
the jars directly into solr\WEB-INF\lib, like you did.



Thanks,

Sandhya



-----Original Message-----
From: Praveen Agrawal [mailto:pkalwar@gmail.com]
Sent: Tuesday, May 04, 2010 8:10 PM
To: solr-user@lucene.apache.org
Subject: Re: Problem with pdf, upgrading Cell



Hi Sandhya..

I must be missing something. I copied all dependencies jars to both

contrib/extraction/lib and web-in/lib folders. Here is the list of jars

copied:



asm-3.1.jar

bcmail-jdk15-1.45.jar

bcprov-jdk15-1.45.jar

commons-compress-1.0.jar

commons-logging-1.1.1.jar

dom4j-1.6.1.jar

fontbox-1.1.0.jar

geronimo-stax-api_1.0_spec-1.0.1.jar

hamcrest-core-1.1.jar

jempbox-1.1.0.jar

junit-3.8.1.jar

log4j-1.2.14.jar

metadata-extractor-2.4.0-beta-1.jar

mockito-core-1.7.jar

nekohtml-1.9.9.jar

objenesis-1.0.jar

ooxml-schemas-1.0.jar

pdfbox-1.1.0.jar

poi-3.6.jar

poi-ooxml-3.6.jar

poi-ooxml-schemas-3.6.jar

poi-scratchpad-3.6.jar

tagsoup-1.2.jar

tika-core-0.7.jar

tika-parsers-0.7.jar

xml-apis-1.0.b2.jar

xmlbeans-2.3.0.jar



Still same result for me..



Marc,

i'm on windows, and i copied above jars directly into already extracted

folder webapps/solr/web-in/lib, in addition to what were already there. I

didn;t explicitly un-jar'd and re-jar'd the solr.war, but do you think that

could be the issue? i think tomcat extract the war and use the folder in

webapps (i didn;t put the war file in webapps, instead had put extracted

solr folder directly)



If it has worked for you guys, specially with my two pdfs, then that's

really great. Please let me know your exact procedure, including what all

you copied and where, or if you see i missed something obvious..



Thanks,

Praveen





On Tue, May 4, 2010 at 5:28 PM, Sandhya Agarwal <sagarwal@opentext.com>wrote:



> Both the files work for me, Praveen.

>

> Thanks,

> Sandhya

>

> From: Praveen Agrawal [mailto:pkalwar@gmail.com]

> Sent: Tuesday, May 04, 2010 5:22 PM

> To: solr-user@lucene.apache.org

> Subject: Re: Problem with pdf, upgrading Cell

>

> another one here..

> On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal <pkalwar@gmail.com<mailto:

> pkalwar@gmail.com>> wrote:

> It bounced because of attachment's size..

> attaching one by one now..

>

>

> On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal <pkalwar@gmail.com<mailto:

> pkalwar@gmail.com>> wrote:

> I noticed following pattern/relationship b/w producer/creator and content

> extraction, not sure if helpful (as Grant told earlier pdfs are notorious):

>

> producer: Bullzip PDF Printer / www.bullzip.com<http://www.bullzip.com> /

> Freeware Edition (not registered)

> Creator: PScript5.dll Version 5.2.2

> Extraction: no content  --  "installing Solr in Tomcat.pdf" (attached - i

> generated)

> ---------------------

>

> Producer: Acrobat Distiller 7.0.5 (Windows)

> creator: PScript5.dll Version 5.2.2

> Extraction: One line content

> ----------------------

>

> Producer: Acrobat Distiller 8.1.0 (Windows)

> creator: Acrobat PDFMaker 8.1 for Word

> Extraction:  one line of content    (Free_Two_way_Radio_Guide.pdf -

> attached - was available freely on their website)

> -------------------------

>

> Producer: FOP 0.20.5

> Extraction: full content    "/docs/features.pdf | linkmap.pdf" etc

> --------------

> Thanks.

> Praveen

>

>

> On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal <pkalwar@gmail.com<mailto:

> pkalwar@gmail.com>> wrote:

> Yes Sandhya,

> i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is

> what you were asking.

> Thanks.

>

>

> On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal <sagarwal@opentext.com

> <mailto:sagarwal@opentext.com>> wrote:

> Praveen,

>

> Along with the tika core and parser jars, did you run "mvn

> dependency:copy-dependencies", to generate all the dependencies too.

>

> Thanks,

> Sandhya

>

> -----Original Message-----

> From: Praveen Agrawal [mailto:pkalwar@gmail.com<mailto:pkalwar@gmail.com>]

> Sent: Tuesday, May 04, 2010 4:52 PM

> To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>

> Subject: Re: Problem with pdf, upgrading Cell

> I seems to have mixed results:

>

> Here is what i did:

> copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in

> contrib/extraction/lib (of-course removed old ones),. as well as in

> web-inf/lib of solr web app in tomcat.

>

> Now it extracts contents from some pdf, but either no content from others,

> or only a line of content. For ex, "/docs/Installing Solr in Tomcat.pdf"

> still shows no contents. I've two other pdfs, for which it extracts only

> one

> line of content.

>

> Also, now i;m getting a field 'title' single value for some pdfs, and two

> for others. In case where it can extract full content, it shows title as

> what i gave as literal while submitting the pdf. For pdf wher no comtent

> was

> extracted, it shows one empty title and one mine. For pdf where it

> extracted

> only one line of content, it shows that line as title too and mine one.

> 'title' field is defined as multivalue in schema.

>

> Any idea, whats going on? or am i missing something?

>

>

>

> On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb <dekay999@hotmail.com

> <mailto:dekay999@hotmail.com>> wrote:

>

> >

> > Hey,

> > I got it to work. I just redid my steps, i had forgotten several

> libraries

> > that were imported through the xml. PDF extraction seems to work once

> again,

> > i have yet to find one that raises an exception!

> >

> > Thanks for the investigation, at least we now have a fix :)

> > Marc

> > _________________________________________________________________

> > Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone,

> > Blackberry, …

> > http://www.messengersurvotremobile.com/?d=Hotmail

> >

>

>

>

>

>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message