xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aleksander Slominski <as...@cs.indiana.edu>
Subject Re: Progressive parsing
Date Sun, 01 Sep 2002 22:25:02 GMT

typical decoder (such as UTF8) takes input as byte stream and converts it
into char reader is _not_ bi-directional. so even if i know that i have
now char 'x' the decoder will not tell me what is current byte(s) position
for this character.

you could try overcome it by keeping reference to original
byte stream and reading position directly form this stream however
this will also not work as any decent decoder and parser will run
over buffered input (and will try to "fix" unbuffered input into buffered one)
or it will even do its own buffering such as Xerces2, and i think also SUN built-in
UTF8 decoder (AFAIR it made very hard  to write Jabber client as UTF8
decoder tried to read t much and blocked ...) in such cases
buffering makes reading byte stream position pretty much useless as
logical position does not necessarily correspond to physical byte stream position
(byte stream will be typically ahead of currently available character from decoder).
moreover even if you manage to turn off buffering it will degrade perfromance of
parser a lot - so you may read physical positions but overall perfromance will be
really bad ...

all of it is true when trying to find a general solution to the problem
however situation is not that bad for simple encodings such as ISO-8859-1
as you always can get original byte stream position as there is one-to-one
correspondence. the only difficulty is for UTF8 - os if you are willing to send
input as ISO-8859-1 or UTF16 or any of other non-variable number of
byte(s)-to-character(s) encodings you should be fine (and easily can
convert logical posiiton to physical byte stream position ...)

finally as source code of Xerces2 is available you can always work on
changing code to suit your needs (but it will take some time to get good solution
especially if you care about perfromance and it seems to be the case ...)

hope it helps,


Paul Libbrecht wrote:

> Well,
> Having a position according to an encoding is honestly, simply... bad.
> One of the goal applications was to be able to be a client of such an
> indexed-database over http/1.1. The latter protocol has a way to request
> only a row of segments of a file. But that can only happen in bytes of
> course.
> When doing it with files, one expects to use, say, the
> InputStream.skip() method which is, hopefully, efficiently implemented
> and skips the cursor in the file-reading underlying routines.
> Skipping x characters using an encoding is simply a killer: the encoding
> has to run through all the characters. For example, in UTF-8, skipping
> an escaped character means skipping three bytes (I think) whereas
> skipping an ASCII character means skipping one byte.
> So... I really meant: "Can I get the byte-position".
> Currently, the only way is to build thing index using a
> "load-in-memory-than-rewrite-to-file"... I can live with this but I
> would have expected "fine parsers" to provide more.
> Paul
> On Mardi, août 27, 2002, at 04:42 , Aleksander Slominski wrote:
> >> Finally... to xerces makers/users: how do I get the byte position of an
> >> element declaration I've just been handed to by the sax parser ?
> >
> > this is more complex as parser works on UTF-16 characters (char)
> > so obtaining position of original stream if it was not UTF-16 is very
> > difficult. however i think that for your cases it is enough to get
> > position of start/end element in character stream. ability to obtain
> > position is not currently part of xerces2 but you can take a look on my
> > patch that adds to XMLLocator function getCurrentEntityAbsoluteOffset()
> > that can be used to get current position of parser. together with
> > changes to XMLDocumentFragmentScannerImpl it is possible to get
> > start/end position of every XML event in XNI. for details see:
> >
> > http://www.extreme.indiana.edu/xgws/xsoap/xpp/download/PullParser2/lib/xerces2_patched/
> ---------------------------------------------------------------------
> In case of troubles, e-mail:     webmaster@xml.apache.org
> To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
> For additional commands, e-mail: general-help@xml.apache.org

In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

View raw message