tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: HTML <meta> tags
Date Wed, 24 Sep 2008 21:04:13 GMT
Hi,

On Wed, Sep 24, 2008 at 10:17 PM, Brian Levay <brian.levay@gmail.com> wrote:
> I'll submit the updates when I'm done (along with unit tests).  I'm having a
> problem though.  I sync'ed my tika baseline this morning and the Matcher
> stopped matching the <meta> tags.  Any idea what my be causing this?

Most likely the TIKA-140 fix that I committed recently. You may want
to try reverting this change:

--- incubator/tika/trunk/src/main/java/org/apache/tika/parser/html/HtmlParser.java	2008/09/22
22:57:10	698027
+++ incubator/tika/trunk/src/main/java/org/apache/tika/parser/html/HtmlParser.java	2008/09/22
23:00:27	698028
@@ -102,7 +102,7 @@
         // Parse the HTML document
         xhtml.startDocument();
         SAXParser parser = new SAXParser();
-        parser.setContentHandler(handler);
+        parser.setContentHandler(new XHTMLDowngradeHandler(handler));
         parser.parse(new InputSource(Utils.getUTF8Reader(stream, metadata)));
         xhtml.endDocument();
     }

> The <meta> handler isn't being called.  If I use /HTML/HEAD//node() the
> handler will get called for the <head> and <title> tags but it will skip
> right past the <meta> tags.  I know the tika code is seeing the META tags
> because I see the tags trying to be matched in the startElement method of
> MatchingContentHandler.  Any ideas?

The XHTMLDowngradeHandler wrapper will uppercase all element names and
drop all namespaces and namespaced attributes, but as far as I can
tell your code should still match the META tags. But there might be
some bug in the XHTMLDowngradeHandler code that breaks things.

BR,

Jukka Zitting

Mime
View raw message