lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandhya Agarwal <sagar...@opentext.com>
Subject RE: Problem with pdf, upgrading Cell
Date Tue, 04 May 2010 10:43:02 GMT
Ok. In tika 0.4 and 0.5, I see that this is how the tika config is loaded :



public static TikaConfig getDefaultConfig()

  {

    InputStream stream;

    try

    {

      stream = TikaConfig.class.getResourceAsStream("/org/apache/tika/tika-config.xml");



      return new TikaConfig(stream);

    } catch (IOException e) {

      throw new RuntimeException("Unable to read default configuration", e);

    }

    catch (SAXException e) {

      throw new RuntimeException("Unable to parse default configuration", e);

    }

    catch (TikaException e) {

      throw new RuntimeException("Unable to access default configuration", e);

    }

  }



And this has changed in tika 0.7, to



public TikaConfig()

    throws MimeTypeException, IOException

  {

    this.parsers = new HashMap();



    ParseContext context = new ParseContext();

    Iterator iterator = ServiceRegistry.lookupProviders(Parser.class);



    while (iterator.hasNext()) {

      Parser parser = (Parser)iterator.next();

      for (Iterator i$ = parser.getSupportedTypes(context).iterator(); i$.hasNext(); ) { MediaType
type = (MediaType)i$.next();

        this.parsers.put(type.toString(), parser);

      }

    }

    this.mimeTypes = MimeTypesFactory.create("tika-mimetypes.xml");

  }



Hence, the reason why we no longer have tika-config.xml, bundled.



Thanks,

Sandhya



-----Original Message-----
From: Grant Ingersoll [mailto:gsiasf@gmail.com] On Behalf Of Grant Ingersoll
Sent: Tuesday, May 04, 2010 4:00 PM
To: solr-user@lucene.apache.org
Subject: Re: Problem with pdf, upgrading Cell



Yes, it is loading the libraries, but they are in a different classloader that apparently
the new way Tika loads doesn't have access to.



-Grant



On May 4, 2010, at 3:28 AM, Sandhya Agarwal wrote:



> Hello,

>

>

>

> But I see that the libraries are being loaded :

>

>

>

> INFO: Adding specified lib dirs to ClassLoader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/asm-3.1.jar' to classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcmail-jdk15-1.45.jar'
to classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/bcprov-jdk15-1.45.jar'
to classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-compress-1.0.jar'
to classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/commons-logging-1.1.1.jar'
to classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/dom4j-1.6.1.jar' to classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/fontbox-1.1.0.jar' to
classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/geronimo-stax-api_1.0_spec-1.0.1.jar'
to classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/jempbox-1.1.0.jar' to
classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/log4j-1.2.14.jar' to
classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/metadata-extractor-2.4.0-beta-1.jar'
to classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/pdfbox-1.1.0.jar' to
classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-3.6.jar' to classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-3.6.jar' to
classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-ooxml-schemas-3.6.jar'
to classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/poi-scratchpad-3.6.jar'
to classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tagsoup-1.2.jar' to classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-core-0.7.jar' to
classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/tika-parsers-0.7.jar'
to classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xercesImpl-2.8.1.jar'
to classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xml-apis-1.0.b2.jar'
to classloader

>

> May 4, 2010 12:49:59 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/extraction/lib/xmlbeans-2.3.0.jar' to
classloader

>

> May 4, 2010 12:50:16 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/dist/apache-solr-cell-1.4.0.jar' to classloader

>

> May 4, 2010 12:50:20 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/dist/apache-solr-clustering-1.4.0.jar' to classloader

>

> May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/carrot2-mini-3.1.0.jar'
to classloader

>

> May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/commons-lang-2.4.jar'
to classloader

>

> May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/ehcache-1.6.2.jar' to
classloader

>

> May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/google-collections-1.0-rc2.jar'
to classloader

>

> May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-core-asl-0.9.9-6.jar'
to classloader

>

> May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/jackson-mapper-asl-0.9.9-6.jar'
to classloader

>

> May 4, 2010 12:51:52 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader

>

> INFO: Adding 'file:/C:/apache-solr-1.4.0/contrib/clustering/lib/log4j-1.2.14.jar' to
classloader

>

>

>

> Thanks,

>

> Sandhya

>

>

>

> -----Original Message-----

> From: Grant Ingersoll [mailto:gsiasf@gmail.com] On Behalf Of Grant Ingersoll

> Sent: Tuesday, May 04, 2010 6:13 AM

> Cc: solr-user@lucene.apache.org

> Subject: Re: Problem with pdf, upgrading Cell

>

>

>

> Little more info... Seems to be a classloading issue.  The tests pass, but they aren't
loading the Tika libraries via the Solr ResourceLoader, whereas the example is.  Marc, one
thing to try is to unjar the Solr WAR file and put the Tika libs in there, as I bet it will
then work.  Note, however, I haven't tried this.

>

>

>

> On May 3, 2010, at 6:24 PM, Grant Ingersoll wrote:

>

>

>

>> I've opened https://issues.apache.org/jira/browse/SOLR-1902 to track this.  It is
indeed a bug somewhere (still investigating).  It seems that Tika is now picking an EmptyParser
implementation when trying to determine which parser to use, despite the fact that it properly
identifies the MIME Type.

>

>>

>

>> -Grant

>

>>

>

>> On May 3, 2010, at 5:36 PM, Grant Ingersoll wrote:

>

>>

>

>>> I'm investigating.

>

>>>

>

>>> On May 3, 2010, at 5:17 AM, Marc Ghorayeb wrote:

>

>>>

>

>>>>

>

>>>> Hi,

>

>>>> Grant, i confirm what Praveen has said, any PDF i try does not work with
the new Tika and SVN versions. :(

>

>>>> Marc

>

>>>>

>

>>>>> From: sagarwal@opentext.com

>

>>>>> To: solr-user@lucene.apache.org

>

>>>>> Date: Mon, 3 May 2010 13:05:24 +0530

>

>>>>> Subject: RE: Problem with pdf, upgrading Cell

>

>>>>>

>

>>>>> Hello,

>

>>>>>

>

>>>>> Please let me know if anybody figured out a way out of this issue.

>

>>>>>

>

>>>>> Thanks,

>

>>>>> Sandhya

>

>>>>>

>

>>>>> -----Original Message-----

>

>>>>> From: Praveen Agrawal [mailto:pkalwar@gmail.com]

>

>>>>> Sent: Friday, April 30, 2010 11:14 PM

>

>>>>> To: solr-user@lucene.apache.org

>

>>>>> Subject: Re: Problem with pdf, upgrading Cell

>

>>>>>

>

>>>>> Grant,

>

>>>>> You can try any of the sample pdfs that come in /docs folder of Solr
1.4

>

>>>>> dist'n. I had tried 'Installing Solr in Tomcat.pdf', 'index.pdf' etc.
Only

>

>>>>> metadata i.e. stream_size, content_type apart from my own literals are

>

>>>>> indexed, and content is missing..

>

>>>>>

>

>>>>>

>

>>>>> On Fri, Apr 30, 2010 at 8:52 PM, Grant Ingersoll <gsingers@apache.org>wrote:

>

>>>>>

>

>>>>>> Praveen and Marc,

>

>>>>>>

>

>>>>>> Can you share the PDF (feel free to email my private email) that
fails in

>

>>>>>> Solr?

>

>>>>>>

>

>>>>>> Thanks,

>

>>>>>> Grant

>

>>>>>>

>

>>>>>>

>

>>>>>> On Apr 30, 2010, at 7:55 AM, Marc Ghorayeb wrote:

>

>>>>>>

>

>>>>>>>

>

>>>>>>> Hi

>

>>>>>>> Nope i didn't get it to work... Just like you, command line version
of

>

>>>>>> tika extracts correctly the content, but once included in Solr, no
content

>

>>>>>> is extracted.

>

>>>>>>> What i tried until now is:- Updating the tika libraries inside
Solr 1.4

>

>>>>>> public version, no luck there.- Downloading the latest SVN version,
compiled

>

>>>>>> it, and started from a simple schema, still no luck.- Getting other
versions

>

>>>>>> compiled on hudson (nightly builds), and testing them also, still
no

>

>>>>>> extraction.

>

>>>>>>> I sent a mail on the developpers mailing list but they told me
i should

>

>>>>>> just mail here, hope some developper reads this because it's quite
an

>

>>>>>> important feature of Solr and somehow it got broke between the 1.4
release,

>

>>>>>> and the last version on the svn.

>

>>>>>>> Marc

>

>>>>>>> _________________________________________________________________

>

>>>>>>> Consultez gratuitement vos emails Orange, Gmail, Free, ... directement

>

>>>>>> dans HOTMAIL !

>

>>>>>>> http://www.windowslive.fr/hotmail/agregation/

>

>>>>>>

>

>>>>>> --------------------------

>

>>>>>> Grant Ingersoll

>

>>>>>> http://www.lucidimagination.com/

>

>>>>>>

>

>>>>>> Search the Lucene ecosystem using Solr/Lucene:

>

>>>>>> http://www.lucidimagination.com/search

>

>>>>>>

>

>>>>>>

>

>>>>

>

>>>> _________________________________________________________________

>

>>>> Hotmail et MSN dans la poche? HOTMAIL et MSN sont dispo gratuitement sur
votre téléphone!

>

>>>> http://www.messengersurvotremobile.com/?d=Hotmail

>

>>>

>

>>> --------------------------

>

>>> Grant Ingersoll

>

>>> http://www.lucidimagination.com/

>

>>>

>

>>> Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search

>

>>>

>

>>

>

>> --------------------------

>

>> Grant Ingersoll

>

>> http://www.lucidimagination.com/

>

>>

>

>> Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search

>

>>

>

>

>

> --------------------------

>

> Grant Ingersoll

>

> http://www.lucidimagination.com/

>

>

>

> Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search

>

>



--------------------------

Grant Ingersoll

http://www.lucidimagination.com/



Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message