tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brian Levay" <brian.le...@gmail.com>
Subject Re: HTML <meta> tags
Date Wed, 24 Sep 2008 20:17:25 GMT
I'll submit the updates when I'm done (along with unit tests).  I'm having a
problem though.  I sync'ed my tika baseline this morning and the Matcher
stopped matching the <meta> tags.  Any idea what my be causing this?  I've
tried many variations of the xpath expressions to match the <meta> tags.
Right now my code in HTMLParser looks like this:

        Matcher body = xpath.parse("/HTML/BODY//node()");
        Matcher title = xpath.parse("/HTML/HEAD/TITLE//node()");
        Matcher meta = xpath.parse("/HTML/HEAD/META//node()");
        handler = new TeeContentHandler(
                new MatchingContentHandler(getBodyHandler(xhtml), body),
                new MatchingContentHandler(getTitleHandler(metadata),
title),
                new MatchingContentHandler(getMetaHandler(metadata), meta));

The <meta> handler isn't being called.  If I use /HTML/HEAD//node() the
handler will get called for the <head> and <title> tags but it will skip
right past the <meta> tags.  I know the tika code is seeing the META tags
because I see the tags trying to be matched in the startElement method of
MatchingContentHandler.  Any ideas?

--Brian

On Tue, Sep 23, 2008 at 6:04 PM, Thorsten Scherler <thorsten@apache.org>wrote:

> On Sat, 2008-09-20 at 22:41 +0200, Jukka Zitting wrote:
> > Hi,
> >
> > On Fri, Sep 19, 2008 at 7:16 PM, Brian Levay <brian.levay@gmail.com>
> wrote:
> > > I need to enhance the functionality of HTMLParser to return the HTML
> <meta>
> > > tags found in the document in the Metadata object.  Is overriding
> HTMLParser
> > > (or installing a customer HTMLParser) the best way to do this?
> >
> > We would be happy to receive a patch that adds this feature directly
> > in Tika. :-)
>
> +1
>
> salu2
> >
> > BR,
> >
> > Jukka Zitting
> --
> Thorsten Scherler                                 thorsten.at.apache.org
> Open Source Java                      consulting, training and solutions
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message