As part of my corpora research work I have to work with such large
text files. Wikipedia dumps are bzip2 so I have been working with:
commons/compress/compressors/bzip2/BZip2CompressorInputStream.html
and I consistently notice that it just stops processing without an
error of any kind.
I checked the file at the offset where it stops and I also checked
the file with the Linuz bzip2 utility and nothing seems to be wrong in
any way. The source file I used is:
enwiki-20141008-pages-articles.xml.bz2
which you can get from:
http://torrentz.pl/search?f=articles%20enwiki&safe=0
I am using exactly the code example you had on your user guide:
commons-compress/commons-compress_User Guide.html
aBZ2IFl = IFl.getCanonicalPath();
File OFl = new File(aOFlNm);
aOFlNm = OFl.getCanonicalPath();
// __
InputStream NwIS = Files.newInputStream(Paths.get(aBZ2IFl));
BufferedInputStream BIS = new BufferedInputStream(NwIS);
BZip2CompressorInputStream bz2IS = new BZip2CompressorInputStream(BIS);
OutputStream NwOS = Files.newOutputStream(Paths.get(aOFlNm));
int n = 0;
while (-1 != (n = bz2IS.read(bArBfr))) { NwOS.write(bArBfr, 0, n);
lTtlByts += n; }
NwOS.close();
bz2IS.close();
but it stops abruptly:
// __ aOFlNm: |enwiki-20141008-pages-articles-multistream_20201012174009.440.xml|
// __ |2601| total bytes compressed into |12081280894| processed in
|2586| (ms), |1| (bytes/ms)
real 0m2.955s
user 0m2.996s
sys 0m0.176s
~
_OFL="enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
ls -l "${_OFL}"
wc -l "${_OFL}"
$ _OFL="enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
$ ls -l "${_OFL}"
-r--r--r-- 1 lbrtchx lbrtchx 2601 Oct 12 17:40
enwiki-20141008-pages-articles-multistream_20201012174009.440.xml
$ wc -l "${_OFL}"
41 enwiki-20141008-pages-articles-multistream_20201012174009.440.xml
$ md5sum --text "${_OFL}"
75c87a6650433b5cea4fef0bdae1cc1f
enwiki-20141008-pages-articles-multistream_20201012174009.440.xml
$ sha1sum --text "${_OFL}"
2799934309372685af919c17798e78c1796637ef
enwiki-20141008-pages-articles-multistream_20201012174009.440.xml
$ file --brief "${_OFL}"
ASCII text
$
// __ originally downloaded file checked and decompressed using Linux
bzip2 Version 1.0.6, 6-Sept-2010:
$ which bzip2
/bin/bzip2
_BZ2="bzip2_--version.txt"
bzip2 --version > "${_BZ2}" 2>&1
cat "${_BZ2}" | head -n 1
rm -f "${_BZ2}"
$ _BZ2="bzip2_--version.txt"
$ bzip2 --version > "${_BZ2}" 2>&1
$ cat "${_BZ2}" | head -n 1
bzip2, a block-sorting file compressor. Version 1.0.6, 6-Sept-2010.
$ rm -f "${_BZ2}"
// __ "testing" bz2 file
$ _IFL="enwiki-20141008-pages-articles-multistream.xml.bz2"
$ time bzip2 --test --verbose "${_IFL}"
enwiki-20141008-pages-articles-multistream.xml.bz2: ok
real 93m51.202s
user 92m31.600s
sys 0m35.188s
// __ decompressing bz2 file
$ time bzip2 --decompress --verbose --keep "${_IFL}"
enwiki-20141008-pages-articles-multistream.xml.bz2: done
real 129m39.665s
user 108m15.368s
sys 7m18.684s
$
// __ decompressed file
_IFL="enwiki-20141008-pages-articles-multistream.xml"
ls -l "${_IFL}"
time wc -l "${_IFL}"
time md5sum --text "${_IFL}"
time sha1sum --text "${_IFL}"
file --brief "${_IFL}"
$ _IFL="enwiki-20141008-pages-articles-multistream.xml"
$ ls -l "${_IFL}"
-r--r--r-- 1 lbrtchx lbrtchx 50151236957 Oct 22 2014
enwiki-20141008-pages-articles-multistream.xml
$ time wc -l "${_IFL}"
800855553 enwiki-20141008-pages-articles-multistream.xml
real 26m13.664s
user 1m3.308s
sys 1m30.616s
$ time md5sum --text "${_IFL}"
1cfabd688427728794e7ae75dc93e84c enwiki-20141008-pages-articles-multistream.xml
real 27m39.208s
user 4m14.884s
sys 1m33.788s
$ time sha1sum --text "${_IFL}"
e337572c1957a5a4d7625e3180e16f20e77749b1
enwiki-20141008-pages-articles-multistream.xml
real 30m40.383s
user 8m39.852s
sys 1m32.864s
$ file --brief "${_IFL}"
HTML document, UTF-8 Unicode text, with very long lines
$
// __ file decompressed using common compress bz2 (decompressing worked fine!)
_IFL="enwiki-latest-pages-articles_20201013002000.103.xml"
ls -l "${_IFL}"
time wc -l "${_IFL}"
time md5sum --text "${_IFL}"
time sha1sum --text "${_IFL}"
file --brief "${_IFL}"
$ _IFL="enwiki-latest-pages-articles_20201013002000.103.xml"
$ ls -l "${_IFL}"
-rw-r--r-- 1 lbrtchx lbrtchx 50151236957 Oct 13 03:35
enwiki-latest-pages-articles_20201013002000.103.xml
$ time wc -l "${_IFL}"
800855553 enwiki-latest-pages-articles_20201013002000.103.xml
real 14m44.535s
user 3m55.816s
sys 1m22.816s
$ time md5sum --text "${_IFL}"
1cfabd688427728794e7ae75dc93e84c
enwiki-latest-pages-articles_20201013002000.103.xml
real 16m14.680s
user 3m19.256s
sys 1m30.488s
$ time sha1sum --text "${_IFL}"
e337572c1957a5a4d7625e3180e16f20e77749b1
enwiki-latest-pages-articles_20201013002000.103.xml
real 17m45.103s
user 7m29.988s
sys 1m29.540s
$ file --brief "${_IFL}"
HTML document, UTF-8 Unicode text, with very long lines
$
// __ file decompressed using common compress bz2 (decompressing
somehow abruptly stopped)
_OFL="enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
$ ls -l "enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
-r--r--r-- 1 lbrtchx lbrtchx 2601 Oct 12 17:40
enwiki-20141008-pages-articles-multistream_20201012174009.440.xml
$ wc -l "enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
41 enwiki-20141008-pages-articles-multistream_20201012174009.440.xml
$ cat "enwiki-20141008-pages-articles-multistream_20201012174009.440.xml"
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.9/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.9/
http://www.mediawiki.org/xml/export-0.9.xsd" version="0.9"
xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
<base>http://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.25wmf1</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2" case="first-letter">Media</namespace>
<namespace key="-1" case="first-letter">Special</namespace>
<namespace key="0" case="first-letter" />
<namespace key="1" case="first-letter">Talk</namespace>
<namespace key="2" case="first-letter">User</namespace>
<namespace key="3" case="first-letter">User talk</namespace>
<namespace key="4" case="first-letter">Wikipedia</namespace>
<namespace key="5" case="first-letter">Wikipedia talk</namespace>
<namespace key="6" case="first-letter">File</namespace>
<namespace key="7" case="first-letter">File talk</namespace>
<namespace key="8" case="first-letter">MediaWiki</namespace>
<namespace key="9" case="first-letter">MediaWiki talk</namespace>
<namespace key="10" case="first-letter">Template</namespace>
<namespace key="11" case="first-letter">Template talk</namespace>
<namespace key="12" case="first-letter">Help</namespace>
<namespace key="13" case="first-letter">Help talk</namespace>
<namespace key="14" case="first-letter">Category</namespace>
<namespace key="15" case="first-letter">Category talk</namespace>
<namespace key="100" case="first-letter">Portal</namespace>
<namespace key="101" case="first-letter">Portal talk</namespace>
<namespace key="108" case="first-letter">Book</namespace>
<namespace key="109" case="first-letter">Book talk</namespace>
<namespace key="118" case="first-letter">Draft</namespace>
<namespace key="119" case="first-letter">Draft talk</namespace>
<namespace key="446" case="first-letter">Education Program</namespace>
<namespace key="447" case="first-letter">Education Program
talk</namespace>
<namespace key="710" case="first-letter">TimedText</namespace>
<namespace key="711" case="first-letter">TimedText talk</namespace>
<namespace key="828" case="first-letter">Module</namespace>
<namespace key="829" case="first-letter">Module talk</namespace>
<namespace key="2600" case="first-letter">Topic</namespace>
</namespaces>
</siteinfo>
$
// __ first 45 lines of decompressed file using Linux bzip2
_IFL="enwiki-20141008-pages-articles-multistream.xml"
head -n 45 "${_IFL}"
$ head -n 45 "${_IFL}"
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.9/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.9/
http://www.mediawiki.org/xml/export-0.9.xsd" version="0.9"
xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
<base>http://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.25wmf1</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2" case="first-letter">Media</namespace>
<namespace key="-1" case="first-letter">Special</namespace>
<namespace key="0" case="first-letter" />
<namespace key="1" case="first-letter">Talk</namespace>
<namespace key="2" case="first-letter">User</namespace>
<namespace key="3" case="first-letter">User talk</namespace>
<namespace key="4" case="first-letter">Wikipedia</namespace>
<namespace key="5" case="first-letter">Wikipedia talk</namespace>
<namespace key="6" case="first-letter">File</namespace>
<namespace key="7" case="first-letter">File talk</namespace>
<namespace key="8" case="first-letter">MediaWiki</namespace>
<namespace key="9" case="first-letter">MediaWiki talk</namespace>
<namespace key="10" case="first-letter">Template</namespace>
<namespace key="11" case="first-letter">Template talk</namespace>
<namespace key="12" case="first-letter">Help</namespace>
<namespace key="13" case="first-letter">Help talk</namespace>
<namespace key="14" case="first-letter">Category</namespace>
<namespace key="15" case="first-letter">Category talk</namespace>
<namespace key="100" case="first-letter">Portal</namespace>
<namespace key="101" case="first-letter">Portal talk</namespace>
<namespace key="108" case="first-letter">Book</namespace>
<namespace key="109" case="first-letter">Book talk</namespace>
<namespace key="118" case="first-letter">Draft</namespace>
<namespace key="119" case="first-letter">Draft talk</namespace>
<namespace key="446" case="first-letter">Education Program</namespace>
<namespace key="447" case="first-letter">Education Program
talk</namespace>
<namespace key="710" case="first-letter">TimedText</namespace>
<namespace key="711" case="first-letter">TimedText talk</namespace>
<namespace key="828" case="first-letter">Module</namespace>
<namespace key="829" case="first-letter">Module talk</namespace>
<namespace key="2600" case="first-letter">Topic</namespace>
</namespaces>
</siteinfo>
<page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
$
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
|