lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <>
Subject RE: Benchmark: EnwikiDocMaker does not use fileIn (BufferedReader)
Date Fri, 10 Apr 2009 07:22:29 GMT
Hi Shai,


with XML parsers you should generally avoid using Readers, unless you know
exactly that the underlying XML encoding is really the one given to the
Reader. Readers as parameters should only be used for sources that are
invariant of the encoding (like Java Strings containing XML, and without
encoding declaration!!!!).


Good examples of correctly using a Reader are:

- new InputSource(new StringReader("<tag>..</tag>"));  // no xml declaration

- An XML stream serialized from a SAX/DOM to a Writer itself (so it is
without encoding), e.g. stored in a Lucene Stored String.


But documents from unknown source should always handled as byte streams. The
XML parser must be able to switch the encoding according to the declaration
it found in XML header, this is not possible with Readers.




Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen


From: Shai Erera [] 
Sent: Friday, April 10, 2009 8:47 AM
Subject: Benchmark: EnwikiDocMaker does not use fileIn (BufferedReader)


I started working on the patch for 1591, and noticed EnwikiDocMaker uses the
FileInputStream instance from LineDocMaker and not the BuferredReader. I
don't see any reason to this, as InputSource accepts a Reader. I can change
it as part of 1591, unless you think I'm missing something.

View raw message