lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Praveen Agrawal <pkal...@gmail.com>
Subject Re: Problem with pdf, upgrading Cell
Date Tue, 04 May 2010 11:50:40 GMT
It bounced because of attachment's size..
attaching one by one now..


On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal <pkalwar@gmail.com> wrote:

> I noticed following pattern/relationship b/w producer/creator and content
> extraction, not sure if helpful (as Grant told earlier pdfs are notorious):
>
> producer: Bullzip PDF Printer / www.bullzip.com / Freeware Edition (not
> registered)
> Creator: PScript5.dll Version 5.2.2
> Extraction: no content  --  "installing Solr in Tomcat.pdf" (attached - i
> generated)
> ---------------------
>
> Producer: Acrobat Distiller 7.0.5 (Windows)
> creator: PScript5.dll Version 5.2.2
> Extraction: One line content
> ----------------------
>
> Producer: Acrobat Distiller 8.1.0 (Windows)
> creator: Acrobat PDFMaker 8.1 for Word
> Extraction:  one line of content    (Free_Two_way_Radio_Guide.pdf - attached
> - was available freely on their website)
> -------------------------
>
> Producer: FOP 0.20.5
> Extraction: full content    "/docs/features.pdf | linkmap.pdf" etc
> --------------
> Thanks.
> Praveen
>
>
>
> On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal <pkalwar@gmail.com> wrote:
>
>> Yes Sandhya,
>> i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is
>> what you were asking.
>> Thanks.
>>
>>
>>
>> On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal <sagarwal@opentext.com>wrote:
>>
>>> Praveen,
>>>
>>> Along with the tika core and parser jars, did you run "mvn
>>> dependency:copy-dependencies", to generate all the dependencies too.
>>>
>>> Thanks,
>>> Sandhya
>>>
>>> -----Original Message-----
>>> From: Praveen Agrawal [mailto:pkalwar@gmail.com]
>>> Sent: Tuesday, May 04, 2010 4:52 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Problem with pdf, upgrading Cell
>>>
>>> I seems to have mixed results:
>>>
>>> Here is what i did:
>>> copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in
>>> contrib/extraction/lib (of-course removed old ones),. as well as in
>>> web-inf/lib of solr web app in tomcat.
>>>
>>> Now it extracts contents from some pdf, but either no content from
>>> others,
>>> or only a line of content. For ex, "/docs/Installing Solr in Tomcat.pdf"
>>> still shows no contents. I've two other pdfs, for which it extracts only
>>> one
>>> line of content.
>>>
>>> Also, now i;m getting a field 'title' single value for some pdfs, and two
>>> for others. In case where it can extract full content, it shows title as
>>> what i gave as literal while submitting the pdf. For pdf wher no comtent
>>> was
>>> extracted, it shows one empty title and one mine. For pdf where it
>>> extracted
>>> only one line of content, it shows that line as title too and mine one.
>>> 'title' field is defined as multivalue in schema.
>>>
>>> Any idea, whats going on? or am i missing something?
>>>
>>>
>>>
>>> On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb <dekay999@hotmail.com>
>>> wrote:
>>>
>>> >
>>> > Hey,
>>> > I got it to work. I just redid my steps, i had forgotten several
>>> libraries
>>> > that were imported through the xml. PDF extraction seems to work once
>>> again,
>>> > i have yet to find one that raises an exception!
>>> >
>>> > Thanks for the investigation, at least we now have a fix :)
>>> > Marc
>>> > _________________________________________________________________
>>> > Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows Phone,
>>> > Blackberry, …
>>> > http://www.messengersurvotremobile.com/?d=Hotmail
>>> >
>>>
>>
>>
>

Mime
View raw message