commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebb (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (COMPRESS-285) checking of availability of XZ compression is expensive - result should be reused
Date Tue, 29 Jul 2014 06:34:42 GMT

    [ https://issues.apache.org/jira/browse/COMPRESS-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077436#comment-14077436
] 

Sebb commented on COMPRESS-285:
-------------------------------

I think it would be possible to speed up the code in the case that the archive is not an XZ
archive.

1) Call XZCompressorInputStream.matches() first and only check for XZ if it is an XZ archive.
This would require making a local copy of XZ.HEADER_MAGIC; probably sensible to move matches()
to XZUtils as well.

2) Move the check for XZ to the end of CompressorStreamFactory so the other 3 formats are
checked first.

Another possibility would be to add a second constructor which has a Boolean giving the result
of isXZCompressionAvailable.
(null meaning unknown, so must do the check)

> checking of availability of XZ compression is expensive - result should be reused
> ---------------------------------------------------------------------------------
>
>                 Key: COMPRESS-285
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-285
>             Project: Commons Compress
>          Issue Type: Improvement
>          Components: Compressors
>    Affects Versions: 1.5, 1.6, 1.7, 1.8
>         Environment: linux 64-bit, java 7, glassfish, solr, tika
>            Reporter: Wojciech Ɓozowicki
>            Priority: Minor
>              Labels: performance
>
> I use solr with apache tika for indexing documents. Tika uses commons-compress to handle
compressed files. Using sampler (jvisualvm) I have seen that quite a lot of time (5-7%) during
my tests is spent in XZUtils.isXZCompressionAvailable because of unavailable XZ compression
(I guess for each time classloaders spend some time looking for unavailable classes, then
NoClassDefFoundError).
> I think the result of the first check should be stored and reused.
> Here is the stacktrace (just to show the way tika is using commons-compress):
> org.apache.commons.compress.compressors.xz.XZUtils.isXZCompressionAvailable(XZUtils.java:52)
> 	at org.apache.commons.compress.compressors.CompressorStreamFactory.createCompressorInputStream(CompressorStreamFactory.java:140)
> 	at org.apache.tika.parser.pkg.ZipContainerDetector.detectCompressorFormat(ZipContainerDetector.java:95)
> 	at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:81)
> 	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message