jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Jansz <kevin.ja...@exari.com>
Subject Re: jackrabbit, lucene, tika ... and pdfbox [SEC=UNCLASSIFIED]
Date Wed, 09 Mar 2011 23:56:09 GMT
Setting indexing_configuration.xml will stop the extracted text being
added to the index but it won't stop the text-extraction happening
first

... this is at least in my testing, looking at the call stack
LazyTextExtractorField.<init>(Parser, InternalValue, Metadata,
Executor, boolean, int) line: 78
NodeIndexer.createFulltextField(InternalValue, Metadata) line: 841	
NodeIndexer.addBinaryValue(Document, String, InternalValue) line: 456	
NodeIndexer.addValue(Document, InternalValue, Name) line: 324	
NodeIndexer.createDoc() line: 258	
SearchIndex.createDocument(NodeState, NamespaceMappings,
IndexFormatVersion) line: 1077

This means with default tika-config.xml (with the PDF extractor) and
assuming you do have pdfbox in your class path you are getting the hit
of PDF text extraction everytime this binary content is added to the
repository. Setting index_configuration.xml seems to just stop this
being added to the index for querying purposes.


--
Kevin Jansz
kevin.jansz@exari.com
Level 7, 10-16 Queen Street, Melbourne 3000 Australia
Tel +61 3 9621 2773 | Fax +61 3 9621 2776
Exari Systems
Boston | London | Melbourne | Munich
www.exari.com

Test drive our software online - www.exari.com/demo-trial.html
Read our blog on document assembly - blog.exari.com




On 9 March 2011 15:05,  <Ross.Dyson@ipaustralia.gov.au> wrote:
> It wasn't OK for you to add a indexing_configuration.xml and exclude the
> indexing of the binary data?
>
> I tried that and decreased the re-indexing time to around 25%, and
> presumably much smaller index files too.  I left in the attributes that I
> may sometimes want to search on, eg unique keys, document title.
>
> Ross.
>
>
>
> From:        Kevin Jansz <kevin.jansz@exari.com>
> To:        users@jackrabbit.apache.org
> Date:        09/03/2011 02:52 PM
> Subject:        jackrabbit, lucene, tika ... and pdfbox
> ________________________________
>
>
> It's been discussed on this list before but I'm summarising my latest
> issues/findings ...
>
> Our use of jackrabbit is for content storage without the built-in
> search/querying mechanism. It's possible to leave out the
> "SearchIndex" definition in the configuration but you're effectively
> "breaking" the weak reference handling (used by user-management) -
> non-critical and the repository *seems* to work without it despite
> logging warnings. But I feel it's better to leave the SearchIndex,
> therefore querying in ... so:
>
> Weak-references
> -> requires SearchIndex / querying
>    -> requires lucene (for now, there's no simple alternative)
>        -> requires tika (core)
>            -> requires various other format handling libraries for
> different parser implementations
>
> In jackrabbit 2.1.x if you want custom parsers - or in my case no
> parsers and the associated overhead and library dependence - you can't
> easily do this as the jackrabbit-core jar includes a tika-config.xml
> and loads this explicitly (from
> org\apache\jackrabbit\core\query\lucene\tika-config.xml). The only
> work-around is to replace this file in the jar file - not ideal.
>
> It's raised in jiras JCR-2642 (& then TIKA-317) that making (very
> sensible) use of the jar file "Service Provider" mechanism could
> simply things. Drop in a jar file into the classpath that defines
> parsers and this gets used ... my reading of this was that to get no
> parsers we'd simply leave out tika-parsers-0.8.jar from the classpath.
> It also made sense that the jackrabbit-core may still include a
> tika-config.xml to a) use DefaultParser b) explicitly disable zip and
> image extraction. Unfortunately, on upgrading to 2.2.4 errors about
> missing pdfbox libraries (when storing PDF content) led me to this in
> tika-config.xml (in the jackrabbit-core jar file):
>    <parser class="org.apache.jackrabbit.core.query.pdf.PDFParser">
>      <!-- JCR-2838: Override the faulty PDF parser in Tika 0.8 -->
>      <mime>application/pdf</mime>
>    </parser>
>
> Looking at jiras JCR-2838 (& then TIKA-548) it's clear there's a
> problem. I'm not entirely sure why the work around is in
> jackrabbit-core. I would have though putting this in a
> xxxxx-parsers-2.2.4.jar with a META-INF/services/... definition would
> have been the correct way to handle this? To avoid issues of
> parser/service-provider precedence? Perhaps a separate jar-build for
> this issue would be overkill for a point release?
>
> It's not a huge issue I guess as it seems with tika 0.9 (or 0.8.1?)
> the PDF parser issue will be resolved in which case I expect the code
> in org.apache.jackrabbit.core.query.pdf.* will disappear along with
> reference to it from the tika-config.xml. In the mean time we're back
> to having to replace
> org\apache\jackrabbit\core\query\lucene\tika-config.xml in the
> jackrabbit-core to avoid custom parsers (and errors about their
> dependencies). I'm taking the time to mention it here in case it saves
> someone time and also to gauge if our view of lucene, tika and the
> parsers is incorrect - that future releases of jackrabbit may still
> include parsers other than DefaultParser and EmptyParser in it's
> tika-config.xml.
>
> Regards,
> Kevin
>
> --
> Kevin Jansz
> kevin.jansz@exari.com
> Level 7, 10-16 Queen Street, Melbourne 3000 Australia
> Tel +61 3 9621 2773 | Fax +61 3 9621 2776
> Exari Systems
> Boston | London | Melbourne | Munich
> www.exari.com
>
> Test drive our software online - www.exari.com/demo-trial.html
> Read our blog on document assembly - blog.exari.com
>
>
>
> --
> This message contains privileged and confidential information only
> for use by the intended recipient.  If you are not the intended
> recipient of this message, you must not disseminate, copy or use
> it in any manner.  If you have received this message in error,
> please advise the sender by reply e-mail.  Please ensure all
> e-mail attachments are scanned for viruses prior to opening or
> using.
>
>

Mime
View raw message