lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Ghorayeb <dekay...@hotmail.com>
Subject RE: Problem with pdf, upgrading Cell
Date Tue, 04 May 2010 12:21:44 GMT

Praveen,
Did you try the technique I wrote a little earlier?Take your solr.war, put it in a directory
of its own. Execute "jar -xf solr.war", that should extract its content. Next, copy all of
your libraries inside the WEB-INF/lib folder. This means all the extraction/lib files, and
the lib files from the Solr's roots. Once this is done, we now recreate the solr.war by doing
"jar -cvf solr.war *" (the * meaning all the files inside the current directory, so be sure
to be inside the root directory where you extracted the war previously).
Once this is done, put the new solr.war inside the tomcat webapps folder, and recreate from
scratch the solr folder (so as not to leave any overlapping libraries). This should work hopefully.
For the multivalued fields (title for example), this is a know feature/issue of Tika's integration.
In my case, I always provide a literal.title along with my pdfs, but if Tika successfully
extracts a title from the Pdf's meta, then it will create the Solr index entry with an array
of the inputted literal, and the extracted value. There is no way to force an override of
the extracted data with the literals, they just get appended. Someone correct me if i am wrong
here :)
Marc

> Date: Tue, 4 May 2010 11:58:56 +0000
> From: pkalwar@gmail.com
> To: solr-user@lucene.apache.org
> Subject: Re: Problem with pdf, upgrading Cell
> 
> This email contained a .zip file attachment. Raytheon does not allow email attachments
that are considered likely to contain malicious code. For your protection this attachment
has been removed.
> 
> If this email is from an unknown source, please simply delete this email.
> 
> If this email was expected, and it is from a known sender, you may follow the below suggested
instructions to obtain these types of attachments.
> 
> + Instruct the sender to enclose the file(s) in a ".zip" compressed file, and rename
the ".zip" compressed file with a different extension, such as, ".rtnzip".  Password protecting
the renamed ".zip" compressed file adds an additional layer of protection. When you receive
the file, please rename it with the extension ".zip".
> 
> Additional instructions and options on how to receive these attachments can be found
at:
> 
> http://security.it.ray.com/antivirus/extensions.html
> http://security.it.ray.com/news/2007/zipfiles.html
> 
> Should you have any questions or difficulty with these instructions, please contact the
Help Desk at 877.844.4712
> 
> ---
> 
> another one here..
> 
> On Tue, May 4, 2010 at 5:20 PM, Praveen Agrawal <pkalwar@gmail.com> wrote:
> 
> > It bounced because of attachment's size..
> > attaching one by one now..
> >
> >
> >
> > On Tue, May 4, 2010 at 5:17 PM, Praveen Agrawal <pkalwar@gmail.com> wrote:
> >
> >> I noticed following pattern/relationship b/w producer/creator and content
> >> extraction, not sure if helpful (as Grant told earlier pdfs are notorious):
> >>
> >> producer: Bullzip PDF Printer / www.bullzip.com / Freeware Edition (not
> >> registered)
> >> Creator: PScript5.dll Version 5.2.2
> >> Extraction: no content  --  "installing Solr in Tomcat.pdf" (attached - i
> >> generated)
> >> ---------------------
> >>
> >> Producer: Acrobat Distiller 7.0.5 (Windows)
> >> creator: PScript5.dll Version 5.2.2
> >> Extraction: One line content
> >> ----------------------
> >>
> >> Producer: Acrobat Distiller 8.1.0 (Windows)
> >> creator: Acrobat PDFMaker 8.1 for Word
> >> Extraction:  one line of content    (Free_Two_way_Radio_Guide.pdf - attached
> >> - was available freely on their website)
> >> -------------------------
> >>
> >> Producer: FOP 0.20.5
> >> Extraction: full content    "/docs/features.pdf | linkmap.pdf" etc
> >> --------------
> >> Thanks.
> >> Praveen
> >>
> >>
> >>
> >> On Tue, May 4, 2010 at 5:05 PM, Praveen Agrawal <pkalwar@gmail.com>wrote:
> >>
> >>> Yes Sandhya,
> >>> i copied new poi/jempbox/pdfbox/fontbox etc jars too. I believe this is
> >>> what you were asking.
> >>> Thanks.
> >>>
> >>>
> >>>
> >>> On Tue, May 4, 2010 at 5:01 PM, Sandhya Agarwal <sagarwal@opentext.com>wrote:
> >>>
> >>>> Praveen,
> >>>>
> >>>> Along with the tika core and parser jars, did you run "mvn
> >>>> dependency:copy-dependencies", to generate all the dependencies too.
> >>>>
> >>>> Thanks,
> >>>> Sandhya
> >>>>
> >>>> -----Original Message-----
> >>>> From: Praveen Agrawal [mailto:pkalwar@gmail.com]
> >>>> Sent: Tuesday, May 04, 2010 4:52 PM
> >>>> To: solr-user@lucene.apache.org
> >>>> Subject: Re: Problem with pdf, upgrading Cell
> >>>>
> >>>> I seems to have mixed results:
> >>>>
> >>>> Here is what i did:
> >>>> copied new Tika/poi/jempbox/pdfbox/fontbox/log4j jars etc in
> >>>> contrib/extraction/lib (of-course removed old ones),. as well as in
> >>>> web-inf/lib of solr web app in tomcat.
> >>>>
> >>>> Now it extracts contents from some pdf, but either no content from
> >>>> others,
> >>>> or only a line of content. For ex, "/docs/Installing Solr in Tomcat.pdf"
> >>>> still shows no contents. I've two other pdfs, for which it extracts
only
> >>>> one
> >>>> line of content.
> >>>>
> >>>> Also, now i;m getting a field 'title' single value for some pdfs, and
> >>>> two
> >>>> for others. In case where it can extract full content, it shows title
as
> >>>> what i gave as literal while submitting the pdf. For pdf wher no comtent
> >>>> was
> >>>> extracted, it shows one empty title and one mine. For pdf where it
> >>>> extracted
> >>>> only one line of content, it shows that line as title too and mine one.
> >>>> 'title' field is defined as multivalue in schema.
> >>>>
> >>>> Any idea, whats going on? or am i missing something?
> >>>>
> >>>>
> >>>>
> >>>> On Tue, May 4, 2010 at 4:13 PM, Marc Ghorayeb <dekay999@hotmail.com>
> >>>> wrote:
> >>>>
> >>>> >
> >>>> > Hey,
> >>>> > I got it to work. I just redid my steps, i had forgotten several
> >>>> libraries
> >>>> > that were imported through the xml. PDF extraction seems to work
once
> >>>> again,
> >>>> > i have yet to find one that raises an exception!
> >>>> >
> >>>> > Thanks for the investigation, at least we now have a fix :)
> >>>> > Marc
> >>>> > _________________________________________________________________
> >>>> > Hotmail arrive sur votre téléphone ! Compatible Iphone, Windows
Phone,
> >>>> > Blackberry, …
> >>>> > http://www.messengersurvotremobile.com/?d=Hotmail
> >>>> >
> >>>>
> >>>
> >>>
> >>
> >
 		 	   		  
_________________________________________________________________
Découvrez comment SURFER DISCRETEMENT sur un site de rencontres !
http://clk.atdmt.com/FRM/go/206608211/direct/01/
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message