abdera-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Garrett Rooney" <roo...@electricjellyfish.net>
Subject Re: Async Parsing?
Date Thu, 13 Jul 2006 01:06:40 GMT
On 7/12/06, James M Snell <jasnell@gmail.com> wrote:
> I'm not sure I could reasonably envision any use of nonblocking i/o
> operations in an xml parser.  I'm not sure if I've ever seen anyone do
> it before.

Well, I wouldn't expect that you'd put the nonblocking IO inside the
parser, it would be more like you'd be using nonblocking IO to pull
data off the wire and then passing that data off to a SAX style parser
once you get it.

> In any case, I figured you might find this entertaining:
> http://www.snellspace.com/wp/?p=381
> http://danga.com:8081/atom-stream.xml is a never-ending xml stream.
>     URL url = new URL("http://danga.com:8081/atom-stream.xml");
>     // we only care about the feed title and alternate link,
>     // we'll ignore everything else
>     ParseFilter filter = new WhiteListParseFilter();
>     filter.add(new QName("atomStream"));
>     filter.add(Constants.FEED);
>     filter.add(Constants.TITLE);
>     filter.add(Constants.LINK);
>     ParserOptions options = Parser.INSTANCE.getDefaultParserOptions();
>     options.setParseFilter(filter);
>     Document doc = Parser.INSTANCE.parse(
>       url.openStream(),(URI)null,options);
>     Element el = doc.getRoot();
>     // get the first feed in the stream, then continue to iterate
>     // from there, printing the title and alt link to the console
>     Feed feed = el.getFirstChild(Constants.FEED);
>     while (feed != null) {
>       System.out.println(
>         feed.getTitle() + "t" + feed.getAlternateLink().getHref());
>       Feed next = feed.getNextSibling(Constants.FEED);
>       feed.discard();
>       feed = next;
>     }
> There are some memory-creep issues so I wouldn't recommend keeping this
> running forever :-)

This is neat, but it's not really what I'm thinking of.  The use case
I was more concerned about would be a crawler that's trying to pull
down a scary amount of data, but doesn't want to devote a thread to
each one, so as it gets data down it hands it off to a parser as it
gets it.  Now, practically speaking it's debatable if you'd want to
actually do this, it might make more sense to spool the data off
someplace and then parse the feed after you've got it all, unless of
course you're talking about a never ending atom feed ;-)


View raw message