Return-Path: Delivered-To: apmail-xml-general-archive@xml.apache.org Received: (qmail 67184 invoked by uid 500); 1 Sep 2002 22:25:00 -0000 Mailing-List: contact general-help@xml.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: list-post: Reply-To: general@xml.apache.org Delivered-To: mailing list general@xml.apache.org Received: (qmail 67166 invoked from network); 1 Sep 2002 22:25:00 -0000 Message-ID: <3D7293BE.F509506A@cs.indiana.edu> Date: Sun, 01 Sep 2002 18:25:02 -0400 From: Aleksander Slominski X-Mailer: Mozilla 4.79 [en] (Windows NT 5.0; U) X-Accept-Language: en MIME-Version: 1.0 To: general@xml.apache.org Subject: Re: Progressive parsing References: <160C0797-BCE2-11D6-82AB-0003934D43BA@activemath.org> Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N hi, typical decoder (such as UTF8) takes input as byte stream and converts it into char reader is _not_ bi-directional. so even if i know that i have now char 'x' the decoder will not tell me what is current byte(s) position for this character. you could try overcome it by keeping reference to original byte stream and reading position directly form this stream however this will also not work as any decent decoder and parser will run over buffered input (and will try to "fix" unbuffered input into buffered one) or it will even do its own buffering such as Xerces2, and i think also SUN built-in UTF8 decoder (AFAIR it made very hard to write Jabber client as UTF8 decoder tried to read t much and blocked ...) in such cases buffering makes reading byte stream position pretty much useless as logical position does not necessarily correspond to physical byte stream position (byte stream will be typically ahead of currently available character from decoder). moreover even if you manage to turn off buffering it will degrade perfromance of parser a lot - so you may read physical positions but overall perfromance will be really bad ... all of it is true when trying to find a general solution to the problem however situation is not that bad for simple encodings such as ISO-8859-1 as you always can get original byte stream position as there is one-to-one correspondence. the only difficulty is for UTF8 - os if you are willing to send input as ISO-8859-1 or UTF16 or any of other non-variable number of byte(s)-to-character(s) encodings you should be fine (and easily can convert logical posiiton to physical byte stream position ...) finally as source code of Xerces2 is available you can always work on changing code to suit your needs (but it will take some time to get good solution especially if you care about perfromance and it seems to be the case ...) hope it helps, alek Paul Libbrecht wrote: > Well, > > Having a position according to an encoding is honestly, simply... bad. > > One of the goal applications was to be able to be a client of such an > indexed-database over http/1.1. The latter protocol has a way to request > only a row of segments of a file. But that can only happen in bytes of > course. > > When doing it with files, one expects to use, say, the > InputStream.skip() method which is, hopefully, efficiently implemented > and skips the cursor in the file-reading underlying routines. > Skipping x characters using an encoding is simply a killer: the encoding > has to run through all the characters. For example, in UTF-8, skipping > an escaped character means skipping three bytes (I think) whereas > skipping an ASCII character means skipping one byte. > > So... I really meant: "Can I get the byte-position". > Currently, the only way is to build thing index using a > "load-in-memory-than-rewrite-to-file"... I can live with this but I > would have expected "fine parsers" to provide more. > > Paul > > On Mardi, ao�t 27, 2002, at 04:42 , Aleksander Slominski wrote: > >> Finally... to xerces makers/users: how do I get the byte position of an > >> element declaration I've just been handed to by the sax parser ? > > > > this is more complex as parser works on UTF-16 characters (char) > > so obtaining position of original stream if it was not UTF-16 is very > > difficult. however i think that for your cases it is enough to get > > position of start/end element in character stream. ability to obtain > > position is not currently part of xerces2 but you can take a look on my > > patch that adds to XMLLocator function getCurrentEntityAbsoluteOffset() > > that can be used to get current position of parser. together with > > changes to XMLDocumentFragmentScannerImpl it is possible to get > > start/end position of every XML event in XNI. for details see: > > > > http://www.extreme.indiana.edu/xgws/xsoap/xpp/download/PullParser2/lib/xerces2_patched/ > > --------------------------------------------------------------------- > In case of troubles, e-mail: webmaster@xml.apache.org > To unsubscribe, e-mail: general-unsubscribe@xml.apache.org > For additional commands, e-mail: general-help@xml.apache.org --------------------------------------------------------------------- In case of troubles, e-mail: webmaster@xml.apache.org To unsubscribe, e-mail: general-unsubscribe@xml.apache.org For additional commands, e-mail: general-help@xml.apache.org