commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Albretch Mueller <lbrt...@gmail.com>
Subject [compress] BZip2CompressorInputStream stops working without rhyme or reason ...
Date Tue, 13 Oct 2020 12:43:57 GMT
 As part of my corpora research work I have to work with such large
text files. Wikipedia dumps are bzip2 so I have been working with:

 commons/compress/compressors/bzip2/BZip2CompressorInputStream.html

 and I consistently notice that it just stops processing without an
error of any kind.

 I checked the file at the offset where it stops and I also checked
the file with the Linuz bzip2 utility and nothing seems to be wrong in
any way. The source file I used is:

 enwiki-20141008-pages-articles.xml.bz2

 which you can get from:

 http://torrentz.pl/search?f=articles%20enwiki&safe=0

 I am using exactly the code example you had on your user guide:

 commons-compress/commons-compress_User Guide.html


    aBZ2IFl = IFl.getCanonicalPath();

    File OFl = new File(aOFlNm);
    aOFlNm = OFl.getCanonicalPath();
// __
    InputStream NwIS = Files.newInputStream(Paths.get(aBZ2IFl));
    BufferedInputStream BIS = new BufferedInputStream(NwIS);
    BZip2CompressorInputStream bz2IS = new BZip2CompressorInputStream(BIS);

    OutputStream NwOS = Files.newOutputStream(Paths.get(aOFlNm));
    int n = 0;
    while (-1 != (n = bz2IS.read(bArBfr))) { NwOS.write(bArBfr, 0, n);
 lTtlByts += n; }
    NwOS.close();
    bz2IS.close();

 but it stops abruptly:

// __ aOFlNm: |enwiki-20141008-pages-articles-multistream_20201012174009.440.xml|
// __ |2601| total bytes compressed into |12081280894| processed in
|2586| (ms), |1| (bytes/ms)

real	0m2.955s
user	0m2.996s
sys	0m0.176s

~
_OFL="enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"

ls -l "${_OFL}"
wc -l "${_OFL}"

$ _OFL="enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"

$ ls -l "${_OFL}"
-r--r--r-- 1 lbrtchx lbrtchx 2601 Oct 12 17:40
enwiki-20141008-pages-articles-multistream_20201012174009.440.xml

$ wc -l "${_OFL}"
41 enwiki-20141008-pages-articles-multistream_20201012174009.440.xml

$ md5sum --text "${_OFL}"
75c87a6650433b5cea4fef0bdae1cc1f
enwiki-20141008-pages-articles-multistream_20201012174009.440.xml

$ sha1sum --text "${_OFL}"
2799934309372685af919c17798e78c1796637ef
enwiki-20141008-pages-articles-multistream_20201012174009.440.xml

$ file --brief  "${_OFL}"
ASCII text

$

// __ originally downloaded file checked and decompressed using Linux
bzip2 Version 1.0.6, 6-Sept-2010:

$ which bzip2
/bin/bzip2

_BZ2="bzip2_--version.txt"
bzip2 --version > "${_BZ2}" 2>&1
cat "${_BZ2}" | head -n 1
rm -f "${_BZ2}"

$ _BZ2="bzip2_--version.txt"
$ bzip2 --version > "${_BZ2}" 2>&1
$ cat "${_BZ2}" | head -n 1
bzip2, a block-sorting file compressor.  Version 1.0.6, 6-Sept-2010.
$ rm -f "${_BZ2}"

// __ "testing" bz2 file

$ _IFL="enwiki-20141008-pages-articles-multistream.xml.bz2"

$ time bzip2 --test --verbose "${_IFL}"
  enwiki-20141008-pages-articles-multistream.xml.bz2: ok

real    93m51.202s
user    92m31.600s
sys     0m35.188s

// __ decompressing bz2 file

$ time bzip2 --decompress --verbose --keep "${_IFL}"
  enwiki-20141008-pages-articles-multistream.xml.bz2: done

real    129m39.665s
user    108m15.368s
sys     7m18.684s
$

// __ decompressed file

_IFL="enwiki-20141008-pages-articles-multistream.xml"
ls -l "${_IFL}"
time wc -l "${_IFL}"
time md5sum --text "${_IFL}"
time sha1sum --text "${_IFL}"
file --brief  "${_IFL}"

$ _IFL="enwiki-20141008-pages-articles-multistream.xml"

$ ls -l "${_IFL}"
-r--r--r-- 1 lbrtchx lbrtchx 50151236957 Oct 22  2014
enwiki-20141008-pages-articles-multistream.xml

$ time wc -l "${_IFL}"
800855553 enwiki-20141008-pages-articles-multistream.xml

real    26m13.664s
user    1m3.308s
sys     1m30.616s

$ time md5sum --text "${_IFL}"
1cfabd688427728794e7ae75dc93e84c  enwiki-20141008-pages-articles-multistream.xml

real    27m39.208s
user    4m14.884s
sys     1m33.788s

$ time sha1sum --text "${_IFL}"
e337572c1957a5a4d7625e3180e16f20e77749b1
enwiki-20141008-pages-articles-multistream.xml

real    30m40.383s
user    8m39.852s
sys     1m32.864s

$ file --brief  "${_IFL}"
HTML document, UTF-8 Unicode text, with very long lines
$

// __ file decompressed using common compress bz2 (decompressing worked fine!)

_IFL="enwiki-latest-pages-articles_20201013002000.103.xml"
ls -l "${_IFL}"
time wc -l "${_IFL}"
time md5sum --text "${_IFL}"
time sha1sum --text "${_IFL}"
file --brief  "${_IFL}"

$ _IFL="enwiki-latest-pages-articles_20201013002000.103.xml"

$ ls -l "${_IFL}"
-rw-r--r-- 1 lbrtchx lbrtchx 50151236957 Oct 13 03:35
enwiki-latest-pages-articles_20201013002000.103.xml

$ time wc -l "${_IFL}"
800855553 enwiki-latest-pages-articles_20201013002000.103.xml

real    14m44.535s
user    3m55.816s
sys     1m22.816s

$ time md5sum --text "${_IFL}"
1cfabd688427728794e7ae75dc93e84c
enwiki-latest-pages-articles_20201013002000.103.xml

real    16m14.680s
user    3m19.256s
sys     1m30.488s

$ time sha1sum --text "${_IFL}"
e337572c1957a5a4d7625e3180e16f20e77749b1
enwiki-latest-pages-articles_20201013002000.103.xml

real    17m45.103s
user    7m29.988s
sys     1m29.540s

$ file --brief  "${_IFL}"
HTML document, UTF-8 Unicode text, with very long lines

$


// __ file decompressed using common compress bz2 (decompressing
somehow abruptly stopped)

_OFL="enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"

$ ls -l "enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
-r--r--r-- 1 lbrtchx lbrtchx 2601 Oct 12 17:40
enwiki-20141008-pages-articles-multistream_20201012174009.440.xml

$ wc -l "enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
41 enwiki-20141008-pages-articles-multistream_20201012174009.440.xml

$ cat "enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.9/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.9/
http://www.mediawiki.org/xml/export-0.9.xsd" version="0.9"
xml:lang="en">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>enwiki</dbname>
    <base>http://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.25wmf1</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Media</namespace>
      <namespace key="-1" case="first-letter">Special</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Talk</namespace>
      <namespace key="2" case="first-letter">User</namespace>
      <namespace key="3" case="first-letter">User talk</namespace>
      <namespace key="4" case="first-letter">Wikipedia</namespace>
      <namespace key="5" case="first-letter">Wikipedia talk</namespace>
      <namespace key="6" case="first-letter">File</namespace>
      <namespace key="7" case="first-letter">File talk</namespace>
      <namespace key="8" case="first-letter">MediaWiki</namespace>
      <namespace key="9" case="first-letter">MediaWiki talk</namespace>
      <namespace key="10" case="first-letter">Template</namespace>
      <namespace key="11" case="first-letter">Template talk</namespace>
      <namespace key="12" case="first-letter">Help</namespace>
      <namespace key="13" case="first-letter">Help talk</namespace>
      <namespace key="14" case="first-letter">Category</namespace>
      <namespace key="15" case="first-letter">Category talk</namespace>
      <namespace key="100" case="first-letter">Portal</namespace>
      <namespace key="101" case="first-letter">Portal talk</namespace>
      <namespace key="108" case="first-letter">Book</namespace>
      <namespace key="109" case="first-letter">Book talk</namespace>
      <namespace key="118" case="first-letter">Draft</namespace>
      <namespace key="119" case="first-letter">Draft talk</namespace>
      <namespace key="446" case="first-letter">Education Program</namespace>
      <namespace key="447" case="first-letter">Education Program
talk</namespace>
      <namespace key="710" case="first-letter">TimedText</namespace>
      <namespace key="711" case="first-letter">TimedText talk</namespace>
      <namespace key="828" case="first-letter">Module</namespace>
      <namespace key="829" case="first-letter">Module talk</namespace>
      <namespace key="2600" case="first-letter">Topic</namespace>
    </namespaces>
  </siteinfo>
$

// __ first 45 lines of decompressed file using Linux bzip2

_IFL="enwiki-20141008-pages-articles-multistream.xml"

head -n 45 "${_IFL}"

$ head -n 45 "${_IFL}"
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.9/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.9/
http://www.mediawiki.org/xml/export-0.9.xsd" version="0.9"
xml:lang="en">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>enwiki</dbname>
    <base>http://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.25wmf1</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Media</namespace>
      <namespace key="-1" case="first-letter">Special</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Talk</namespace>
      <namespace key="2" case="first-letter">User</namespace>
      <namespace key="3" case="first-letter">User talk</namespace>
      <namespace key="4" case="first-letter">Wikipedia</namespace>
      <namespace key="5" case="first-letter">Wikipedia talk</namespace>
      <namespace key="6" case="first-letter">File</namespace>
      <namespace key="7" case="first-letter">File talk</namespace>
      <namespace key="8" case="first-letter">MediaWiki</namespace>
      <namespace key="9" case="first-letter">MediaWiki talk</namespace>
      <namespace key="10" case="first-letter">Template</namespace>
      <namespace key="11" case="first-letter">Template talk</namespace>
      <namespace key="12" case="first-letter">Help</namespace>
      <namespace key="13" case="first-letter">Help talk</namespace>
      <namespace key="14" case="first-letter">Category</namespace>
      <namespace key="15" case="first-letter">Category talk</namespace>
      <namespace key="100" case="first-letter">Portal</namespace>
      <namespace key="101" case="first-letter">Portal talk</namespace>
      <namespace key="108" case="first-letter">Book</namespace>
      <namespace key="109" case="first-letter">Book talk</namespace>
      <namespace key="118" case="first-letter">Draft</namespace>
      <namespace key="119" case="first-letter">Draft talk</namespace>
      <namespace key="446" case="first-letter">Education Program</namespace>
      <namespace key="447" case="first-letter">Education Program
talk</namespace>
      <namespace key="710" case="first-letter">TimedText</namespace>
      <namespace key="711" case="first-letter">TimedText talk</namespace>
      <namespace key="828" case="first-letter">Module</namespace>
      <namespace key="829" case="first-letter">Module talk</namespace>
      <namespace key="2600" case="first-letter">Topic</namespace>
    </namespaces>
  </siteinfo>
  <page>
    <title>AccessibleComputing</title>
    <ns>0</ns>
    <id>10</id>
$

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message