lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mikemcc...@apache.org
Subject svn commit: r763416 - in /lucene/java/trunk/contrib/benchmark: README.enwiki build.xml lib/xerces-2.9.0.jar lib/xerces-2.9.1-patched-XERCESJ-1257.jar
Date Wed, 08 Apr 2009 21:50:34 GMT
Author: mikemccand
Date: Wed Apr  8 21:50:33 2009
New Revision: 763416

URL: http://svn.apache.org/viewvc?rev=763416&view=rev
Log:
LUCENE-1591: workaround bug in xerces so we can process Wikipedia's XML

Added:
    lucene/java/trunk/contrib/benchmark/lib/xerces-2.9.1-patched-XERCESJ-1257.jar   (with
props)
Removed:
    lucene/java/trunk/contrib/benchmark/lib/xerces-2.9.0.jar
Modified:
    lucene/java/trunk/contrib/benchmark/README.enwiki
    lucene/java/trunk/contrib/benchmark/build.xml

Modified: lucene/java/trunk/contrib/benchmark/README.enwiki
URL: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/benchmark/README.enwiki?rev=763416&r1=763415&r2=763416&view=diff
==============================================================================
--- lucene/java/trunk/contrib/benchmark/README.enwiki (original)
+++ lucene/java/trunk/contrib/benchmark/README.enwiki Wed Apr  8 21:50:33 2009
@@ -20,3 +20,50 @@
 test. Ant targets get-enwiki, expand-enwiki, and extract-enwiki can
 also be used to download, decompress, and extract (to individual files
 in work/enwiki) the dataset, respectively.
+
+NOTE: This bug in Xerces:
+
+  https://issues.apache.org/jira/browse/XERCESJ-1257
+
+which is still present as of 2.9.1, causes an exception like this when
+processing Wikipedia's XML:
+
+Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte
UTF-8 sequence.
+	at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
+	at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
+	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
+	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
+	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
+	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
+	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
+	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
+	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
+	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
+	at org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker$Parser.run(EnwikiDocMaker.java:77)
+	... 1 more
+
+The original poster in the Xerces bug provided this patch:
+
+--- UTF8Reader.java	2006-11-23 00:36:53.000000000 +0100
++++ /home/rainman/lucene/xerces-2_9_0/src/org/apache/xerces/impl/io/UTF8Reader.java	2008-04-04
00:40:58.000000000 +0200
+@@ -534,6 +534,16 @@
+                     invalidByte(4, 4, b2);
+                 }
+ 
++                // check if output buffer is large enough to hold 2 surrogate chars
++                if( out + 1 >= offset + length ){
++                    fBuffer[0] = (byte)b0;
++                    fBuffer[1] = (byte)b1;
++                    fBuffer[2] = (byte)b2;
++                    fBuffer[3] = (byte)b3;
++                    fOffset = 4;
++                    return out - offset;
++		}
++
+                 // decode bytes into surrogate characters
+                 int uuuuu = ((b0 << 2) & 0x001C) | ((b1 >> 4) & 0x0003);
+                 if (uuuuu > 0x10) {
+
+which I've applied to Xerces 2.9.1 sources, and committed under
+lib/xerces-2.9.1-patched-XERCESJ-1257.jar.  Once XERCESJ-1257 is fixed
+we can upgrade to a standard Xerces release.

Modified: lucene/java/trunk/contrib/benchmark/build.xml
URL: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/benchmark/build.xml?rev=763416&r1=763415&r2=763416&view=diff
==============================================================================
--- lucene/java/trunk/contrib/benchmark/build.xml (original)
+++ lucene/java/trunk/contrib/benchmark/build.xml Wed Apr  8 21:50:33 2009
@@ -104,7 +104,7 @@
     <property name="collections.jar" value="commons-collections-3.1.jar"/>
     <property name="logging.jar" value="commons-logging-1.0.4.jar"/>
     <property name="bean-utils.jar" value="commons-beanutils-1.7.0.jar"/>
-    <property name="xercesImpl.jar" value="xerces-2.9.0.jar"/>
+    <property name="xercesImpl.jar" value="xerces-2.9.1-patched-XERCESJ-1257.jar"/>
     <property name="xml-apis.jar" value="xml-apis-2.9.0.jar"/>
 
     <path id="classpath">

Added: lucene/java/trunk/contrib/benchmark/lib/xerces-2.9.1-patched-XERCESJ-1257.jar
URL: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/benchmark/lib/xerces-2.9.1-patched-XERCESJ-1257.jar?rev=763416&view=auto
==============================================================================
Binary file - no diff available.

Propchange: lucene/java/trunk/contrib/benchmark/lib/xerces-2.9.1-patched-XERCESJ-1257.jar
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream



Mime
View raw message