tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brian Levay" <brian.le...@gmail.com>
Subject RE: HTML <meta> tags
Date Thu, 25 Sep 2008 02:01:10 GMT
Dave,

I thought it might have been my configuration at work so I tried again at
home on a clean machine tonight and I get the same error.  My files are
identical to your diff.  For me it won't enter the startElement() method for
the meta handler.

I attached the three files (HTMLParser, HTMLParserTester, testHTML.html).

I'm sure this is something simple.  Maybe some kind of configuration
difference?  I've tried Java 5 and 6 with no change.  Do you have a zip of
all your dependent tika .jar files I can try to use?  That is my only guess
now.

--Brian

-----Original Message-----
From: Dave Meikle [mailto:loompa@gmail.com] 
Sent: Wednesday, September 24, 2008 5:04 PM
To: tika-dev@incubator.apache.org
Subject: Re: HTML <meta> tags

Hi

2008/9/24 Brian Levay <brian.levay@gmail.com>

> I'll submit the updates when I'm done (along with unit tests).  I'm having
> a
> problem though.  I sync'ed my tika baseline this morning and the Matcher
> stopped matching the <meta> tags.  Any idea what my be causing this?  I've
> tried many variations of the xpath expressions to match the <meta> tags.
> Right now my code in HTMLParser looks like this:
>
>        Matcher body = xpath.parse("/HTML/BODY//node()");
>        Matcher title = xpath.parse("/HTML/HEAD/TITLE//node()");
>        Matcher meta = xpath.parse("/HTML/HEAD/META//node()");
>        handler = new TeeContentHandler(
>                new MatchingContentHandler(getBodyHandler(xhtml), body),
>                new MatchingContentHandler(getTitleHandler(metadata),
> title),
>                new MatchingContentHandler(getMetaHandler(metadata),
meta));
>
> The <meta> handler isn't being called.  If I use /HTML/HEAD//node() the
> handler will get called for the <head> and <title> tags but it will skip
> right past the <meta> tags.  I know the tika code is seeing the META tags
> because I see the tags trying to be matched in the startElement method of
> MatchingContentHandler.  Any ideas?
>
> --Brian
>

I am using effectively the same thing in a local copy and have just re-based
it again HEAD (shown in the diff below), and it appears to be working fine
for me.

What is your test XML like?

Cheers,
Dave


Index: src/main/java/org/apache/tika/parser/html/HtmlParser.java
===================================================================
--- src/main/java/org/apache/tika/parser/html/HtmlParser.java    (revision
698705)
+++ src/main/java/org/apache/tika/parser/html/HtmlParser.java    (working
copy)
@@ -95,9 +95,11 @@
         XPathParser xpath = new XPathParser(null, "");
         Matcher body = xpath.parse("/HTML/BODY//node()");
         Matcher title = xpath.parse("/HTML/HEAD/TITLE//node()");
+        Matcher meta = xpath.parse("/HTML/HEAD/META//node()");
         handler = new TeeContentHandler(
                 new MatchingContentHandler(getBodyHandler(xhtml), body),
-                new MatchingContentHandler(getTitleHandler(metadata),
title));
+                new MatchingContentHandler(getTitleHandler(metadata),
title),
+                new MatchingContentHandler(getMetaHandler(metadata),
meta));

         // Parse the HTML document
         xhtml.startDocument();
@@ -116,6 +118,17 @@
         };
     }

+    private ContentHandler getMetaHandler(final Metadata metadata) {
+        return new WriteOutContentHandler() {
+            @Override
+            public void startElement(
+                    String uri, String local, String name, Attributes atts)
+                    throws SAXException {
+                    metadata.set(atts.getValue(0), atts.getValue(1));
+            }
+        };
+    }
+
     private ContentHandler getBodyHandler(final XHTMLContentHandler xhtml)
{
         return new TextContentHandler(xhtml) {

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message