lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Becker <pbec...@dstc.edu.au>
Subject Re: Lucene crawler plan
Date Wed, 02 Jul 2003 01:10:02 GMT
Erik Hatcher wrote:

[...some Ant related things I should look at...]

>> What are the issues with JTidy?
>
>
> The version number!  Its ancient.  It does a decent job with even 
> mangled HTML though - I just suspect something better surely is out 
> there by now.

My colleague had the same thought, but I think that is not a problem. 
The HTML 4.01 recommendation is from Christmas 1999. I don't really see 
any reason why they should have changed it once it worked good enough. 
Of course website programmers might have come up with other forms of 
weirdness in the code by now, but I can easily imagine that this is not 
a problem if the original parser was robust enough.

  Peter


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message