nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Dean <seand...@link.enhancededge.com>
Subject Corrupt GZIP trailer
Date Fri, 13 May 2005 06:22:47 GMT
Hello Everyone,

I'm currently having an issue creating an index with anything above around 20000
records.

I get the following error output:

link# bin/nutch index segments/20050511025841
expr: syntax error
050513 021102 parsing file:/usr/local/nutch/conf/nutch-default.xml
050513 021102 parsing file:/usr/local/nutch/conf/nutch-site.xml
050513 021102 No FS indicated, using default:local
050513 021102 indexing segment: segments/20050511025841
050513 021103 * Opening segment 20050511025841
050513 021103 * Indexing segment 20050511025841
050513 021103 Plugins: looking in: /usr/local/nutch/plugins
050513 021103 parsing: /usr/local/nutch/plugins/query-site/plugin.xml
050513 021103 not including: /usr/local/nutch/plugins/parse-ext
050513 021103 not including: /usr/local/nutch/plugins/ontology
050513 021103 parsing: /usr/local/nutch/plugins/protocol-http/plugin.xml
050513 021103 not including: /usr/local/nutch/plugins/parse-pdf
050513 021103 parsing: /usr/local/nutch/plugins/index-basic/plugin.xml
050513 021103 parsing: /usr/local/nutch/plugins/parse-text/plugin.xml
050513 021103 parsing: /usr/local/nutch/plugins/query-url/plugin.xml
050513 021103 not including: /usr/local/nutch/plugins/clustering-carrot2
050513 021103 not including: /usr/local/nutch/plugins/parse-msword
050513 021103 not including: /usr/local/nutch/plugins/query-more
050513 021103 parsing: /usr/local/nutch/plugins/urlfilter-regex/plugin.xml
050513 021104 not including: /usr/local/nutch/plugins/urlfilter-prefix
050513 021104 not including: /usr/local/nutch/plugins/creativecommons
050513 021104 parsing: /usr/local/nutch/plugins/query-basic/plugin.xml
050513 021104 not including: /usr/local/nutch/plugins/language-identifier
050513 021104 parsing: /usr/local/nutch/plugins/parse-html/plugin.xml
050513 021104 found resource common-terms.utf8 at
file:/usr/local/nutch/conf/common-terms.utf8
050513 021444  Processed 20000 records (90.39344 rec/s)
Exception in thread "main" java.io.IOException: Corrupt GZIP trailer
        at java.util.zip.GZIPInputStream.readTrailer(GZIPInputStream.java:174)
        at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:89)
        at
org.apache.nutch.io.WritableUtils.readCompressedByteArray(WritableUtils.java:34)
        at
org.apache.nutch.io.WritableUtils.readCompressedString(WritableUtils.java:64)
        at org.apache.nutch.parse.ParseText.readFields(ParseText.java:43)
        at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:278)
        at org.apache.nutch.io.MapFile$Reader.next(MapFile.java:335)
        at org.apache.nutch.io.ArrayFile$Reader.next(ArrayFile.java:61)
        at org.apache.nutch.segment.SegmentReader.next(SegmentReader.java:333)
        at
org.apache.nutch.indexer.IndexSegment.indexPages(IndexSegment.java:130)
        at org.apache.nutch.indexer.IndexSegment.main(IndexSegment.java:254)


I have read up on this issue, even finding a correction but it didnt work for
me. (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4262583) I wonder if
anyone else has come across this problem and found a solution?

I have tried running it on both FreeBSD 5.4 with JDK1.4.2 and Windows (Cygwin)
with JDK1.4.2 and 1.5 resulting with the same problem detailed above.

Thanks,

Sean

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

Mime
View raw message