commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brad Neuberg <b...@columbia.edu>
Subject [feedparser] Large number of bug fixes for Jakarta Feed Parser
Date Thu, 03 Mar 2005 00:46:23 GMT
The last few weeks I have been working on improving Rojo's feed 
subscription system, which uses the Jakarta Feed Parser.  To do so I built 
an automated testing framework that scripted our UI and ran a large amount 
of data through it, fixing subscription bugs as I encountered them.  A 
number of bugs I was able to trace into the Jakarta Feed Parser and fixed 
them.  I am attaching a patch file that has these bug fixes.  Here were the 
following changes that I needed to fix the following bugs:

* Xanga feeds were not working.  Xanga does not support the autodiscovery 
standard and doesn't have anything in their HTML that allows us to use HTML 
link probing to find an RSS or Atom feed.  Instead, we have to do aggresive 
probing at well known locations.  Unfortunately, some blogging services 
incorrectly support HTTP redirects and also return 200 OKs on files that 
don't exist, giving false positives.  When the ProbeLocator is running an 
probing for RSS files at well known locations, for some blogging services 
we have to pay attention to the value returned by an HTTP redirect and for 
others we have to ignore any redirects.  I've added a new method named 
followRedirects() to the base BlogService class; each individual blogging 
service now returns either true or false on whether to follow redirects 
that may be returned when probing this remote service.  It turns out that 
for all other services that we currently deal with we must ignore redirects 
but not for Xanga.  The Xanga BlogService class returns true for 
followRedirects(), making it possible to work with these blogs now.
* The FeedLocator class does three different kinds of RSS discovery: 
autodiscovery probing, HTML source analysis, and aggresive probing.  When 
each of these stages were adding discovered links to our list of found RSS 
feeds they were not first checking to make sure we hadn't already found 
that particular link through a different kind of discovery mechanism.  This 
has been fixed.  This lead to duplicate feeds in the list, which broke 
downstream systems that use the Jakarta Feed Parser.
* Exact subscriptions to Craigs List feeds were not working, such as 
"http://www.craigslist.org/w4m/index.xml".  A FeedReference can either be 
absolute, such as 
"http://rss.groups.yahoo.com/group/talkinaboutarchitecture/rss", or 
relative, such as "/atom.xml".  When we are doing the ProbeLocator, we 
first discover the BlogService we are dealing with and then "ask" that blog 
service for its usual list of feed locations, which are returned as 
FeedReferences.  I added a FeedReference.isRelative() method so that when 
we are in the ProbeLocator we correctly build up the HTTP path to do remote 
probing of this particular FeedReference based on whether it is a relative 
or absolute path.
* Yahoo Groups were not working.  I modified the YahooGroups BlogService 
object to correctly work.
* When subscribing to some feeds a NullPointerException was thrown in the 
EntityDecoder; fixed this.
* AOL LiveJournal feeds were not working.  I modified the AOLJournal 
BlogService object to correctly work.
* Some feed services, such as AOL LiveJournal, are case-sensitive when 
retrieving feeds.  We were incorrectly lower casing all feeds in 
DiscoveryLocator and BlogServiceDiscovery; we now keep the case that is 
discovered through autodiscovery.
* I rewrote parts of ResourceExpander; a large number of feeds weren't 
being subscribed due to bugs in how we were expanding URIs.

I also discovered a serious bug in our LinkLocator process that I wasn't 
able to fix.  It turns out that we scan the document looking for certain 
kinds of links to see if they are RSS links; however, we ignore an A HREF 
tag if it has an image inside.  This is extremely dangerous, though, since 
most pages have the orange XML icon on their page, hyperlinked to their 
feed! If this is fixed I suspect we will be able to find a much larger 
amount of feeds through the LinkLocator in the future.

The patch is too big to place here.  I have put it on my web server at 
http://codinginparadise.org/feedparser/feed_refactor_patch_01_02_2005.txt

Best,
Brad Neuberg, bkn3@columbia.edu
Senior Software Engineer, Rojo Networks
Weblog: http://www.codinginparadise.org





---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Mime
View raw message