lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tommaso Teofili <tommaso.teof...@gmail.com>
Subject Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox
Date Wed, 28 Jul 2010 13:56:32 GMT
This was my same feeling :-) and so I went for the trunk to have things
working quickly, but I also have to consider which one is the best version
since I am going to deploy it in the near future in an enterprise
environment and choosing the best version is an importat step.
I am quite new to Solr but I agree with Alessandro that probably using a
slightly patched release should theoretically be more stable than the trunk
which get many updates weekly (and daily).
Cheers,
Tommaso

2010/7/28 David Thibault <dthibault@esperion.com>

> Thanks, I'll try that then. I kind of figured that'd be the answer, but
> after fighting with Solr & ExtractingRequestHandler for 2 days I also just
> wanted to be done with it once it started working with 4.0...=)  However,
> stability would be better in the long run.
>
> Best,
> Dave
>
> -----Original Message-----
> From: Alessandro Benedetti [mailto:benedetti.alex85@gmail.com]
> Sent: Wednesday, July 28, 2010 9:33 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr
> CELL/Tika/PDFBox
>
> In my opinion, the 1.4.1 version with the Patch is more Stable.
> Until 4.0 will be released ....
>
> 2010/7/28 David Thibault <dthibault@esperion.com>
>
> > Yesterday I did get this working with version 4.0 from trunk.  I haven't
> > fully tested it yet, but the content doesn't come through blank anymore,
> so
> > that's good.  Would it be more stable to stick with 1.4.1 and your patch
> to
> > get to Tika 0.8, or to stick with the 4.0 trunk version?
> >
> > Best,
> > Dave
> >
> > -----Original Message-----
> > From: Tommaso Teofili [mailto:tommaso.teofili@gmail.com]
> > Sent: Wednesday, July 28, 2010 3:31 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with
> Solr
> > CELL/Tika/PDFBox
> >
> > I attached a patch for Solr 1.4.1 release on
> > https://issues.apache.org/jira/browse/SOLR-1902 that made things work
> for
> > me.
> > This strange behaviour for me was due to the fact that I copied the
> patched
> > jars and war inside the dist directory but forgot to update the war
> inside
> > the example/webapps directory (that is inside Jetty).
> > Hope this helps.
> > Tommaso
> >
> > 2010/7/27 David Thibault <dthibault@esperion.com>
> >
> > > Alessandro & all,
> > >
> > > I was having the same issue with Tika crashing on certain PDFs.  I also
> > > noticed the bug where no content was extracted after upgrading Tika.
> > >
> > > When I went to the SOLR issue you link to below, I applied all the
> > patches,
> > > downloaded the Tika 0.8 jars, restarted tomcat, posted a file via curl,
> > and
> > > got the following error:
> > > SEVERE: java.lang.NoSuchMethodError:
> > >
> >
> org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader;
> > > at
> > >
> >
> org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93)
> > > at
> > >
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244)
> > > at
> > >
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
> > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> > > at
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> > > at
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
> > > at
> > >
> >
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> > > at
> > >
> >
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> > > at
> > >
> >
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> > > at
> > >
> >
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> > > at
> > >
> >
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> > > at
> > >
> >
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> > > at
> > >
> >
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> > > at
> > >
> >
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
> > > at
> > >
> >
> org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859)
> > > at
> > >
> >
> org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
> > > at
> > org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555)
> > > at java.lang.Thread.run(Thread.java:619)
> > >
> > > This is really weird because I DID apply the SolrResourceLoader patch
> > that
> > > adds the getClassLoader method.  I even verified by going opening up
> the
> > > JARs and looking at the class file in Eclipse...I can see the
> > > SolrResourceLoader.getClassLoader() method.
> > >
> > > Does anyone know why it can't find the method?  After patching the
> source
> > I
> > > did ant clean dist in the base directory of the Solr source tree and
> > > everything looked like it compiles (BUILD SUCCESSFUL).  Then I copied
> all
> > > the jars from dist/ and all the library dependencies from
> > > contrib/extraction/lib/ into my SOLR_HOME. Restarting tomcat,
> everything
> > in
> > > the logs looked good.
> > >
> > > I'm stumped.  It would be very nice to have a Solr implementation using
> > the
> > > newest versions of PDFBox & Tika and actually have content being
> > > extracted...=)
> > >
> > > Best,
> > > Dave
> > >
> > >
> > > -----Original Message-----
> > > From: Alessandro Benedetti [mailto:benedetti.alex85@gmail.com]
> > > Sent: Tuesday, July 27, 2010 6:09 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with
> > Solr
> > > CELL/Tika/PDFBox
> > >
> > > Hi Jon,
> > > During the last days we front the same problem.
> > > Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't
> extract
> > > content and from others, Solr throws an exception during the Indexing
> > > Process .
> > > You must:
> > > Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8
> > > snapshot and tika-parsers 0.8.
> > > Update PdfBox and all related libraries.
> > > After that You have to patch Solr 1.4.1 following this patch :
> > >
> > >
> >
> https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
> > > This is the firts way to solve the problem.
> > >
> > > Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no
> exception
> > > is
> > > thrown during the Indexing process, but no content is extracted.
> > > Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated)  all
> > > sounds good but we don't know how stableit is!
> > > I hope you have now a clear  vision of this issue,
> > > Best Regards
> > >
> > >
> > >
> > > 2010/7/26 Sharp, Jonathan <JSharp@coh.org>
> > >
> > > >
> > > > Every so often I need to index new batches of scanned PDFs and
> > > occasionally
> > > > Adobe's OCR can't recognize the text in a couple of these documents.
> In
> > > > these situations I would like to type in a small amount of text onto
> > the
> > > > document and have it be extracted by Solr CELL.
> > > >
> > > > Adobe Pro 9 has a number of different ways to add text directly to a
> > PDF
> > > > file:
> > > >
> > > > *Typewriter
> > > > *Sticky Note
> > > > *Callout boxes
> > > > *Text boxes
> > > >
> > > > I tried indexing documents with each of these text additions with
> Solr
> > > > 1.4.1 + Solr CELL but can't extract the text in any of these boxes.
> > > >
> > > > If someone has modified their Solr CELL installation to use more
> recent
> > > > versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can
> > > comment
> > > > on whether newer versions can pull the text out of any of these
> various
> > > text
> > > > boxes I'd appreciate that very much.
> > > >
> > > > -Jon
> > > >
> > > >
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > SECURITY/CONFIDENTIALITY WARNING:
> > > > This message and any attachments are intended solely for the
> individual
> > > or
> > > > entity to which they are addressed. This communication may contain
> > > > information that is privileged, confidential, or exempt from
> disclosure
> > > > under applicable law (e.g., personal health information, research
> data,
> > > > financial information). Because this e-mail has been sent without
> > > > encryption, individuals other than the intended recipient may be able
> > to
> > > > view the information, forward it to others or tamper with the
> > information
> > > > without the knowledge or consent of the sender. If you are not the
> > > intended
> > > > recipient, or the employee or person responsible for delivering the
> > > message
> > > > to the intended recipient, any dissemination, distribution or copying
> > of
> > > the
> > > > communication is strictly prohibited. If you received the
> communication
> > > in
> > > > error, please notify the sender immediately by replying to this
> message
> > > and
> > > > deleting the message and any accompanying files from your system. If,
> > due
> > > to
> > > > the security risks, you do not wish to receive further communications
> > via
> > > > e-mail, please reply to this message and inform the sender that you
> do
> > > not
> > > > wish to receive further e-mail from the sender.
> > > >
> > > > ---------------------------------------------------------------------
> > > >
> > > >
> > >
> > >
> > > --
> > > --------------------------
> > >
> > > Benedetti Alessandro
> > > Personal Page: http://tigerbolt.altervista.org
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
> > >
> >
> >
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Personal Page: http://tigerbolt.altervista.org
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message