abdera-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James M Snell <jasn...@gmail.com>
Subject Filtering
Date Mon, 12 Jun 2006 20:58:35 GMT
I've modified the element filtering mechanisms to provide significantly
greater control over the filtering of elements, attributes and text
content during the parse process.  This change also modifies the "turbo"
optimization that I posted about last week. The API is fairly
rudimentary right now and could likely use some tweaking.

===============================================
Text Filtering
===============================================
For example, to filter text content (e.g. if you have escaped html in an
atom:content element and you want to filter out unwanted/unsafe tags, or
if you wanted to replace all instances of the word "Foo" with "Bar"
within elements in the "urn:bar" namespace, you would provide an
instance of org.apache.abdera.filter.TextFilter via ParserOptions, like so:

    TextFilter filter = new TextFilter() {
      public String filterText(
        String text,
        Element parent) {
          QName qname = ((ExtensionElement)parent).getQName();
          if (qname.getNamespaceURI().equals("urn:bar")) {
            text = text.replaceAll("Foo", "Bar");
          }
          return text;
      }

      public String filterAttributeText(
        String text,
        QName attribute,
        Element parent) {
          return text;
     }
    };

    URL url = Test.class.getResource("/test.xml");
    ParserOptions options = Parser.INSTANCE.getDefaultParserOptions();
    options.setTextFilter(filter);
    Document<Feed> doc = Parser.INSTANCE.parse(
      url.openStream(), url.toURI(), options);

The parser applies the text filter during the parse.  Passing in the
following XML,

  <?xml version='1.0' ?>
  <feed xmlns="http://www.w3.org/2005/Atom">
    <author><name>Test</name></author>
    <entry>
      <author><name>Foo</name></author>
      <content type="application/xml"><a:a xmlns:a="urn:foo"><a:b><c
xmlns="urn:bar" a:b="a">Foo<d/>Bar<d/>Foo</c></a:b></a:a></content>
    </entry>
  </feed>

What you'd actually see in via the Feed Object Model interfaces and in
the reserialized XML is:

  <?xml version='1.0' ?>
  <feed xmlns="http://www.w3.org/2005/Atom">
    <author><name>Test</name></author>
    <entry>
      <author><name>Foo</name></author>
      <content type="application/xml"><a:a xmlns:a="urn:foo"><a:b><c
xmlns="urn:bar" a:b="a">Bar<d/>Bar<d/>Bar</c></a:b></a:a></content>
    </entry>
  </feed>


===============================================
Tag Filtering
===============================================

If you wanted to filter out unwanted tags (e.g. if you wanted to do a
speed optimized parse), you simply set the ParseFilter on the
ParserOptions.  There are currently two ParseFilter implementations
available: WhiteListParseFilter and BlackListParseFilter.

In the WhiteListParseFilter, only the QNames explicitly listed (and, by
default, their attributes) will be parsed.  There is an option to
require that acceptable attributes be listed explicitly.

In the BlackListParseFilter, all QNames NOT listed will be parsed.

For example, using the above example feed,

    ParseFilter filter = new WhiteListParseFilter();
    filter.add(Constants.FEED);
    filter.add(Constants.ENTRY);
    filter.add(Constants.AUTHOR);

    URL url = Test.class.getResource("/test.xml");
    ParserOptions options = Parser.INSTANCE.getDefaultParserOptions();
    options.setParseFilter(filter);
    Document<Feed> doc = Parser.INSTANCE.parse(
      url.openStream(), url.toURI(), options);
    System.out.println(doc.getRoot());

Outputs:

  <feed xmlns="http://www.w3.org/2005/Atom">
    <author />
    <entry>
      <author />

    </entry>
  </feed>

While the following (black-list):

    ParseFilter filter = new BlackListParseFilter();
    filter.add(new QName("urn:foo", "b"));

    URL url = Test.class.getResource("/test.xml");
    ParserOptions options = Parser.INSTANCE.getDefaultParserOptions();
    //options.setTextFilter(filter);
    options.setParseFilter(filter);
    Document<Feed> doc = Parser.INSTANCE.parse(
      url.openStream(), url.toURI(), options);
    System.out.println(doc.getRoot());

Outputs:

  <feed xmlns="http://www.w3.org/2005/Atom">
    <author><name>Test</name></author>
    <entry>
      <author><name>Foo</name></author>
      <content type="application/xml"><a:a xmlns:a="urn:foo" /></content>
    </entry>
  </feed>

===============================================
Attribute Filtering
===============================================

The ParseFilter can also handle attribute filtering on an
element-by-element basis.  For instance, suppose we want to filter out
the "b" attribute from the "urn:foo" namespace on elements named {urn:bar}c

  ParseFilter filter = new BlackListParseFilter();
  filter.addAttribute(
    new QName("urn:bar", "c"),
    new QName("urn:foo", "b"));

The sample feed above would come out as (note the a:b attribute on the
<c xmlns="urn:bar" /> element is missing.

  <?xml version='1.0' ?>
  <feed xmlns="http://www.w3.org/2005/Atom">
    <author><name>Test</name></author>
    <entry>
      <author><name>Foo</name></author>
      <content type="application/xml"><a:a xmlns:a="urn:foo"><a:b><c
xmlns="urn:bar">Foo<d/>Bar<d/>Foo</c></a:b></a:a></content>
    </entry>
  </feed>

Thoughts? Concerns? Good change? Bad change?

- James


Mime
View raw message