xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Clark <an...@apache.org>
Subject Re: HTML Parser Update Available
Date Fri, 12 Apr 2002 09:19:48 GMT
Harald Hett wrote:
> > 2) A property, for example:
> >
> >   "http://cyberneko.org/html/names/modify"  { "upper", "lower",
> > "default" }
> >
> > [These are just examples. I might want to modify the names.]
> >
> A property would be great, but with "no" instead of "default".

This weekend I'll be working on adding some minor features to
NekoHTML. During that time, I'll add some properties to allow
the application to control how NekoHTML handles element and
attribute names from the source document. I'm currently
thinking of the following properties:


each with the following allowed values:

  { "upper", "lower", "default" }

Since I changed the property names, your request to change 
the "default" value to "no" doesn't apply anymore. So I'm
still using "default". Does this make more sense or should
it be changed to something else entirely, like "nochange" or
"specified" or ...?

In addition, I'll be adding code to allow the application
to set which encoding to use by default. Right now the default
is Cp1252 which is the standard Windows locale (on English
machines). But I'm running into the situation where I'm
parsing Japanese HTML pages that do not have an http-equiv
directive specifying the encoding. This is probably because
they falsely assume that only people with Japanese systems 
are visiting their web site. (It's not just English-speaking
programmers that ignore globalization, people! ;)

In this case, I need to add some kind of intelligence to my
app so the parser uses a different default encoding. For
example, if the domain ends with ".jp" then assume "EUC-JP" 
or "Shift_JIS" encoding. <tangent>It would be awesome if
there was an "AutoDetect" Japanese decoder for Java. But
until then I'll just have to pick one.</tangent> Anyway, 
this would be set through a property as well.

I'm also thinking of adding code to allow the tag balancer
to pass infoset augmentations along the pipeline. Specifically,
information regarding whether the event info is specified in
the document or "synthesized" by the tag balancer. This would
allow people at the end of the pipeline to tell exactly what
was really in the source document. (For performance reasons,
though, this feature would be "off" by default.)

There'll be enough minor changes to warrant boosting the
version number to "0.4.0" instead of "0.3.4". Just a heads
up, in case anyone cares.

> > > Is it planned to include NekoHTML into the Xerces release?
> >
> [...]
> Unfortunately the link to CyberNeko is not well known in the public. I
> only got notice of it by reading your recent postings in
> general@xml.apache.org. I think it should be either included in the
> Xerces distribution or made accessible from the xerces homepage.

We are having a discussion in a separate thread regarding this 
very topic. Depending on the result of that discussion, NekoHTML
will either be rolled into Xerces OR become a separate project
of its own. In the latter case, it's not clear whether separate
projects are included in the Xerces codebase (but kept separate)
or hosted elsewhere.

I have no problem with NekoHTML remaining separate but it would
be nice to have links to related projects from the Xerces page,
as you suggest.

Andy Clark * andyc@apache.org

In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

View raw message