abdera-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James M Snell <jasn...@gmail.com>
Subject Abdera and IRIs
Date Thu, 21 Sep 2006 20:52:35 GMT
Ok, so I've been looking into what is needed to allow Abdera to truly
support IRIs as called for by the Atom spec.  A week ago, the only
viable option was to introduce a dependency on ICU, which gives us the
unicode and IDNA support but didn't actually provide an IRI
implementation.  For that, we would have had to introduce yet another on
something like the Jena projects IRI implementation (which uses ICU).

Now, ICU is a very nice package and is pretty much THE standard for
handling unicode in Java.  The problem is that it's a very large package
and includes a whole lot more than we actually need.  (e.g. we don't
need the calendar, collation, unicode compression, etc).

So over the last week I've been working on some code to see how small of
an implementation of the basic IRI/IDNA/Unicode stuff we could get and
still claim compliance.  While more testing is needed, I've got a jar
that weighs in at a relatively lightweight 326.5kb and provides support
for IRI, IDNA, Punycode, Unicode Normalization, supplementary
characters, etc.

Working with an IRI is almost identical to working with a java.net.URI.

  IRI iri = new IRI("http://www.詹姆斯.com/feed");

  System.out.println(iri.toString());
  System.out.println(iri.toASCIIString());

  > http://www.詹姆斯.com/feed
  > http://www.xn--8ws00zhy3a.com/feed

  System.out.println(iri.getHost());
  System.out.println(iri.getASCIIHost());

  > www.詹姆斯.com
  > www.xn--8ws00zhy3a.com

  IRI iri1 = new IRI("http://www.詹姆斯.com/feed");
  IRI iri2 = new IRI("http://www.xn--8ws00zhy3a.com/feed");

  System.out.println(iri1.equals(iri2));
  System.out.println(iri1.equivalent(iri2));

  > false
  > true

The implementation also provides things that java's URI implementation
doesn't.  Such as scheme specific equivalent checking.

There are even test cases already that, while not 100% comprehensive,
provide fairly decent coverage based on examples given in the various
RFC's implemented.

That said...

Right now, the IRI implementation depends on my Unicode implementation,
which hasn't, of course, had anywhere near the level of testing ICU has
had.  It would be possible, however, for me to change the IRI
implementation so that it can use either ICU or my Unicode stuff
depending on whether ICU is in the classpath.  If ICU is present, I can
use that unicode and IDNA implementation instead of mine.  It makes
things a bit more complicated, but it's definitely something I can do.

What I'm proposing is that I check in my IRI/IDNA/Unicode implementation
and that we use it as the default impl.  The code would become part of
the parser module.  After checking the code in and updating Abdera to
use it, I'll work on enabling the automatic ICU switch.

or...

I create a branch of the trunk and integrate my implementation into the
branch.  We kick the tires around on it, see if it works, work on
enabling the ICU switch and when we get both working and we're all
comfortable with it, we merge back into the trunk.

- James

Mime
View raw message