jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ross.Dy...@ipaustralia.gov.au
Subject Re: jackrabbit, lucene, tika ... and pdfbox [SEC=UNCLASSIFIED]
Date Wed, 09 Mar 2011 04:05:27 GMT
It wasn't OK for you to add a indexing_configuration.xml and exclude the 
indexing of the binary data?

I tried that and decreased the re-indexing time to around 25%, and 
presumably much smaller index files too.  I left in the attributes that I 
may sometimes want to search on, eg unique keys, document title.

Ross.



From:   Kevin Jansz <kevin.jansz@exari.com>
To:     users@jackrabbit.apache.org
Date:   09/03/2011 02:52 PM
Subject:        jackrabbit, lucene, tika ... and pdfbox



It's been discussed on this list before but I'm summarising my latest
issues/findings ...

Our use of jackrabbit is for content storage without the built-in
search/querying mechanism. It's possible to leave out the
"SearchIndex" definition in the configuration but you're effectively
"breaking" the weak reference handling (used by user-management) -
non-critical and the repository *seems* to work without it despite
logging warnings. But I feel it's better to leave the SearchIndex,
therefore querying in ... so:

Weak-references
-> requires SearchIndex / querying
    -> requires lucene (for now, there's no simple alternative)
        -> requires tika (core)
            -> requires various other format handling libraries for
different parser implementations

In jackrabbit 2.1.x if you want custom parsers - or in my case no
parsers and the associated overhead and library dependence - you can't
easily do this as the jackrabbit-core jar includes a tika-config.xml
and loads this explicitly (from
org\apache\jackrabbit\core\query\lucene\tika-config.xml). The only
work-around is to replace this file in the jar file - not ideal.

It's raised in jiras JCR-2642 (& then TIKA-317) that making (very
sensible) use of the jar file "Service Provider" mechanism could
simply things. Drop in a jar file into the classpath that defines
parsers and this gets used ... my reading of this was that to get no
parsers we'd simply leave out tika-parsers-0.8.jar from the classpath.
It also made sense that the jackrabbit-core may still include a
tika-config.xml to a) use DefaultParser b) explicitly disable zip and
image extraction. Unfortunately, on upgrading to 2.2.4 errors about
missing pdfbox libraries (when storing PDF content) led me to this in
tika-config.xml (in the jackrabbit-core jar file):
    <parser class="org.apache.jackrabbit.core.query.pdf.PDFParser">
      <!-- JCR-2838: Override the faulty PDF parser in Tika 0.8 -->
      <mime>application/pdf</mime>
    </parser>

Looking at jiras JCR-2838 (& then TIKA-548) it's clear there's a
problem. I'm not entirely sure why the work around is in
jackrabbit-core. I would have though putting this in a
xxxxx-parsers-2.2.4.jar with a META-INF/services/... definition would
have been the correct way to handle this? To avoid issues of
parser/service-provider precedence? Perhaps a separate jar-build for
this issue would be overkill for a point release?

It's not a huge issue I guess as it seems with tika 0.9 (or 0.8.1?)
the PDF parser issue will be resolved in which case I expect the code
in org.apache.jackrabbit.core.query.pdf.* will disappear along with
reference to it from the tika-config.xml. In the mean time we're back
to having to replace
org\apache\jackrabbit\core\query\lucene\tika-config.xml in the
jackrabbit-core to avoid custom parsers (and errors about their
dependencies). I'm taking the time to mention it here in case it saves
someone time and also to gauge if our view of lucene, tika and the
parsers is incorrect - that future releases of jackrabbit may still
include parsers other than DefaultParser and EmptyParser in it's
tika-config.xml.

Regards,
Kevin

--
Kevin Jansz
kevin.jansz@exari.com
Level 7, 10-16 Queen Street, Melbourne 3000 Australia
Tel +61 3 9621 2773 | Fax +61 3 9621 2776
Exari Systems
Boston | London | Melbourne | Munich
www.exari.com

Test drive our software online - www.exari.com/demo-trial.html
Read our blog on document assembly - blog.exari.com


Mime
View raw message