lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Benchmark: EnwikiDocMaker does not use fileIn (BufferedReader)
Date Sat, 11 Apr 2009 03:02:01 GMT
Thanks Uwe. Then I think we should at least wrap the IS with a Buffered IS
in EnwikiDocMaker (that's what I wanted to achieve in the first place,
reusing LDM's BufferedReader)?

On Fri, Apr 10, 2009 at 10:22 AM, Uwe Schindler <uwe@thetaphi.de> wrote:

>  Hi Shai,
>
>
>
> with XML parsers you should generally avoid using Readers, unless you know
> exactly that the underlying XML encoding is really the one given to the
> Reader. Readers as parameters should only be used for sources that are
> invariant of the encoding (like Java Strings containing XML, and without
> encoding declaration!!!!).
>
>
>
> Good examples of correctly using a Reader are:
>
> - new InputSource(new StringReader(“<tag>….</tag>”));  // no xml
> declaration
>
> - An XML stream serialized from a SAX/DOM to a Writer itself (so it is
> without encoding), e.g. stored in a Lucene Stored String.
>
>
>
> But documents from unknown source should always handled as byte streams.
> The XML parser must be able to switch the encoding according to the
> declaration it found in XML header, this is not possible with Readers.
>
>
>
> Uwe
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>   ------------------------------
>
> *From:* Shai Erera [mailto:serera@gmail.com]
> *Sent:* Friday, April 10, 2009 8:47 AM
> *To:* java-dev@lucene.apache.org
> *Subject:* Benchmark: EnwikiDocMaker does not use fileIn (BufferedReader)
>
>
>
> I started working on the patch for 1591, and noticed EnwikiDocMaker uses
> the FileInputStream instance from LineDocMaker and not the BuferredReader. I
> don't see any reason to this, as InputSource accepts a Reader. I can change
> it as part of 1591, unless you think I'm missing something.
>

Mime
View raw message