jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Bernet <paul.ber...@crealogix.com>
Subject RE: Problem with Indexing XML Docs with Tika in Jackrabbit 2.6.2
Date Tue, 23 Jul 2013 12:36:26 GMT
The cause of my problem was that the Mime-Type of the XML Files was set to text/xml (inside
the application).
With the Mime-Type set to application/xml the XMLParser is called.

The alias in tika-mimetypes.xml Line 3200:

<mime-type type="application/xml">
  <alias type="text/xml"/>

does not seem to have an effect.

Nor does configuring the XMLParser to text/xml in tika-config.xml.

Regards
Paul

-----Original Message-----
From: Paul Bernet [mailto:paul.bernet@crealogix.com] 
Sent: Mittwoch, 17. Juli 2013 18:41
To: users@jackrabbit.apache.org
Subject: Problem with Indexing XML Docs with Tika in Jackrabbit 2.6.2

Hi,

I am migrating a Jackrabbit Instance from 2.2.13 to 2.6.2 using:
jackrabbit-core
jackrabbit-jcr-commons
jackrabbit-jcr-rmi
For indexing I am using the module tika-core and parts of tika-parsers.
Because the module tika-parsers is creating problems (among others the aspectjrt-1.6.x.jar
is in conflict with my one-jar pkg meccano) I try to include only those parser classes and
their dependencies into the Project, so I am able to index .pdf and .xml files. While the
indexing via the PDFParser is working the DcXMLParser parser is not executed and no content
is in the index.
When I configure the EmptyParser with the application/xml Mime-Type EmptyParser is not called
either.

So what confuses me is that the PDFParser config is read from the tika-config.xml (I can proof
that with falsifying the Classname) and called at runtime.
However, the XMLParser is read as well but not called at runtime.

tika-config.xml
...
<mimeTypeRepository resource="/org/apache/tika/mime/tika-mimetypes.xml" magic="false"/>
<parsers>
<parser name="parse-pdf" class="org.apache.tika.parser.pdf.PDFParser">
   <mime>application/pdf</mime>
</parser>

<parser name="parse-dcxml" class="org.apache.tika.parser.xml.DcXMLParser">
  <mime>application/xml</mime>
  <mime>image/svg+xml</mime>
</parser>

<parser class="org.apache.tika.parser.DefaultParser"/>
<parser class=" org.apache.tika.parser.EmptyParser ">
  <!--  <mime>application/xml</mime> -->
</parser>
</parsers>
....

The XML-Files have the Mime-Type application/xml.
The other configuration file /resources/META-INF/services/org.apache.tika.parser.Parser is
in a sub-jar of the one-jar pkg. Because that did not show effect I took it outside and referenced
it explicitly on the classpath on startup but that did not show any effect either. Is this
file needed for the Parsers to work?

Thanks for any hints!
Paul

Mime
View raw message