directory-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject svn commit: rev 20535 - incubator/directory/snickers/trunk/xdocs/ber-codec
Date Fri, 28 May 2004 07:19:36 GMT
Author: akarasulu
Date: Fri May 28 00:19:35 2004
New Revision: 20535

Commit changes ...

 o added the digester design document - still need to touch this up

Added: incubator/directory/snickers/trunk/xdocs/ber-codec/BERDigesterDesign.xml
--- (empty file)
+++ incubator/directory/snickers/trunk/xdocs/ber-codec/BERDigesterDesign.xml	Fri May 28 00:19:35
@@ -0,0 +1,326 @@
+<?xml version="1.0" encoding="UTF-8"?>
+  <properties>
+    <author email="">Alex Karasulu</author>
+    <title>BER Digester Design</title>
+  </properties>
+  <body>
+    <section name="Introduction">
+      <subsection name="What is it?">
+        <p>
+          The BERDigester is a high level decoder that builds object
+          containment trees from a train of TLV Tuple objects it receives via
+          BERDecoder callback events.  It builds these containment trees
+          through the action of rules that are triggered by TLV Tag nesting
+          patterns.
+        </p>
+      </subsection>
+      <subsection name="Why is it called a digester?">
+        <p>
+          The idea for a higher level decoder that builds object containment
+          trees through the action of rules came when I was looking at the
+          Jakarta Commons (SAX event) Digester.  SAX events are very similar
+          to the TLV callback events that are emminated by the BERDecoder.
+          The Commons Digester consumed these events.  While processing these
+          events it tracked the nesting of XML elements and fired registered
+          rules when nesting patterns were matched.  These rules performed
+          operation on the Digester's stack.  Objects were instantiated and
+          pushed, or populated and then popped to build the containment tree.
+        </p>
+        <p>
+          The BERDigester applies the same pattern as does the Common's
+          Digester on a similar underlying event stream.  Although BER Tuple
+          events carry completely different peices of information, like SAX
+          event streams BER TLV event streams signal the nesting pattern of
+          underlying constructs within the stream.  This hence reviels the
+          structure of an ASN.1 data type or in the case of SAX the structure
+          of an XML document.  Both digesters consume events emmination from
+          some event source to build object containment trees based on the
+          nesting pattern revieled by those events.  The containment trees are
+          built through the action of rules triggered by matching nesting
+          patterns.  Just like the Commons Digester, the BERDigester uses
+          stacks to share objects across rule firings.
+        </p>
+        <p>
+          It's safe to say that there is such a thing as a digester pattern.
+          What better way exists to refer to the pattern and those peices of
+          software implementing it, than to just call them digesters?  I could
+          not think of any and that's why we call it the BERDigester.  The BER
+          simply distinguishes it from the Commons Digester so there is no
+          confusion. (Note: BERDigester may be renamed to TupleDigester)
+        </p>
+      </subsection>
+      <subsection name="Rules and Stacks Are Good">
+        <p>
+          Our ultimate goal is to build a stub compiler that generates
+          encoder/decoder pairs for the various ASN.1 encodings: BER, DER, XER,
+          and PER.  The compiler also generates the classes used by a digester
+          to build the containment tree of the ASN.1 data structures
+          transmitted.  Generation of the stubs is not that hard to do.
+          However building a decoder that assembles containment trees using
+          those stubs and an encoded stream is not easy to do.
+        </p>
+        <p>
+          Stacks help out by reducing the complexity of the problem.  A stack
+          can be used first of all to share objects across rule firings.
+          Secondly its a perfect data structure for tracking nesting patterns
+          or managing operations on object containment trees.  State and scope
+          are maintained easily using a stack.
+        </p>
+        <p>
+          If simple stack based instructions are devised to instantiate stub
+          objects, push them and pop them, along with primitives and other
+          classes, then the complexity involved with generating decoders
+          diminishes greatly.  Furthermore, if those instructions were
+          declaritive rules triggered based on nesting patterns then decoder
+          generation is well a cake walk.  All a decoder needs to do is
+          register rules based on tag nesting patterns associated with the
+          ASN.1 modules with a BERDigester.  A generic set of rules may be
+          created or rules can be generated to be registered.  The generated
+          decoder would hence be the setup of the BERDigester by registering
+          rules with it for tag patterns.  Code would also be added to funnel
+          data into the underlying BERDecoder as well as code to respond to
+          the callbacks generated by the top level PDU decoder.
+        </p>
+      </subsection>
+    </section>
+    <section name="BERDigester Design">
+      <subsection name="Nesting Stack">
+        <p>
+          The design documentation for the BERDecoder discusses an encoding
+          for Tags using the first four octets of a Tag field.  These four
+          octets are stored within a primitive Java int.  These ints represent
+          the raw Tag of a TLV tuple.  We could push these raw Tag ints onto
+          an stack: let's for the time being call this primitive Java int stack
+          the tagStack.  Every time a new TLV Tuple is delivered, the digester
+          pushes the Tag int onto the tagStack.  When processing of the TLV
+          Tuple is finished, the tagStack is popped removing the Tag int for
+          the Tuple.  The sequence of Tag ints within the tagStack represent
+          the reverse nesting order of TLV Tuples.  Rule's are registered with
+          the digester using int[]s for patterns, these int[]s are compared
+          against the sequence of ints within the tagStack, hence the tag
+          nesting pattern.  All rules registered with a matching pattern of
+          Tag ints are fired in the order they were registered.
+        </p>
+        <p>
+	        This sounds pretty simple and straight forward at first glance but
+          there are complications.  The BER encoding, provides for two ways in
+          which string types can be encoded: as a primitive TLV or as a
+          constructed TLV.  By string types, we refer to all character based
+          string types like IA5String as well as bit strings and octet strings.
+          String types can be encoded using a single TLV as a primitive type.
+          This way the entire string is kept within the value field.  The
+          designers of BER were smart enough to realize that value chunking
+          would need to occur for large strings to enable efficient decoding
+          and encoding.  Hence these string types can optionally be encoded as
+          constructed TLV tuples with smaller nested tuples carrying fragments
+          of the string.  This way encoders would not need to calculate the
+          length of a large strings in advance or keep the entire string in
+          memory.  The value can be fragmented and chucked explicitly by the
+          encoding.  This is great, however the Tag int for a primitive TLV of
+          a string type will not equal the Tag int for the constructed TLV of
+          the same string type.  Pattern matching cannot occur under these
+          circumstances.
+        </p>
+        <p>
+	        This complication forced me to rethink the use of raw tag integers.
+          Instead of pushing the tag int as is onto the tagStack we could
+          always reduce the raw tag to some canonical form then push that int
+          instead.  Furthermore, all patterns supplied to register rules must
+          also have their ints scrubbed within the int[] pattern, so pattern
+          matching will occur using only the canonical representation of Tag
+          ints rather than actual Tag integers.
+        </p>
+        <p>
+	        The primitive/constructed flag exists on a Tag's first octet as a
+          single bit.  We want to prevent variations in this bit from throwing
+          off our pattern matching.  Choosing a canonical form is like picking
+          whether or not to lower case or to upper case letters before
+          attempting a comparison.  In our case :-), we have chosen the
+          canonical form to be the tag with this primitive/constructed bit
+          turned off.  So before any comparison occurs all tag integers will
+          be guaranteed to have this bit flag turned off.  In fact all TagEnum
+          int values prefabricated by the compiler will have the Tag int
+          scrubbed.  Meaning the primitive/constructed bit will be in the off
+          position and hence in the canonical form.
+        </p>
+      </subsection>
+      <subsection name="Pattern Matching">
+        <p>
+          Without going into too much detail, matching a rule pattern to the
+          nesting of tag ints within the tagStack seems pretty simple.
+          Compare each int value at a position within both int sequences and
+          see if at any point there are differences.  If there are differences
+          the pattern does not match the tag nesting pattern.  If there are no
+          differences then its a match.  The issues with this algorithm start
+          popping up when we consider performance and the number of rules that
+          must be compared within the digester's rule base (set of registered
+          rules).  To calculate the matching rules for a nesting pattern we
+          would have to perform N comparisons where N is the number of
+          registered rules.  These calculations would have to occur every time
+          a tag int is pushed onto the stack.  If you ask me this brute force
+          approach is unacceptable.
+        </p>
+        <p>
+	        To reduce the need to compute N comparisons we have built a special
+          data structure used to store, and rapidly search for all matching
+          rules.  The data structure is a tree of TagNodes.  Every TagNode
+          in the tree stores the canonical representation of a tag int and a
+          set of rules.  TagNodes may have child TagNodes keyed by their tag
+          int value.  A depth first tree walk is conducted to search for the
+          set of rules that are matched by the current nesting pattern.  The
+          tree walk starts at the root of the tree using the Tag int value to
+          determine the child to lookup.  The walk then continues with the
+          child and the next Tag int used to lookup the next child and so on.
+          If the walk consumes all tag ints within the stack and lands on a
+          TagNode then all the rules at that node have matched the nesting
+          pattern.
+        </p>
+        <p>
+	        This tree data structure is built as rules are registered with the
+          digester.  Each rule registered either adds a new path of TagNodes
+          to the tree, of adds itself to an existing node within the tree.
+          Once built the structure can be searched to determine the rules
+          matched by a nesting pattern.  The data structure is small, cheap
+          and turns an order O(nm) search operation where n is the number of
+          rules an m is the current tag nesting depth into a O(m) search
+          operation which is much better.
+        </p>
+        <p>
+          The only problem I have at this point is how to implement matching
+          pattern wildcards used to register rules with.  I have some ideas
+          on how this can be done though.  First of all a special reserved tag
+          int value must be selected so it does not conflict with any valid
+          tags we might encounter.  I think any of the UNIVERSAL tags above
+          32 can be used for this but I would have to check.  Presuming this
+          were known already there are four kinds of wildcard use cases which
+          I have outlined below:
+        </p>
+        <ol>
+          <li>wildcard at the front of the pattern</li>
+          <li>wildcard at the end of the pattern</li>
+          <li>wildcard in the middle of the pattern</li>
+          <li>two or more wildcards in the middle of the pattern</li>
+        </ol>
+        <p>
+          There is no way we would accomodate all these use cases.  First off
+          some of them will be rarely if ever used.  Secondly their
+          implementation would either be impossible or too difficult to
+          implement for the return in expressivity.  At this point I plan to
+          implement any of these use cases on an as needed basis.  However
+          the two noteworthy use cases that stick out are use case #1 and #2.
+          When handling fragmented strings encoded using constructed tuples
+          #2 will come in handy if we want a rule to fire off for each
+          fragment.  Likewise #1 will be useful as well.  Sometimes a pattern
+          may occur multiple times in different places and you want to detect
+          them all using a single rule.  In this case, a pattern with a
+          wildcard at the front (case #1) does the job.  Use cases #3 and #4
+          are very difficult to implement not to mention expensive in terms
+          of processing.  They just don't seem worth it.
+        </p>
+        <p>
+          Now the question is how do we search with a wildcard in the front or
+          at the tail end of the pattern.  Perhaps the easiest of the two to
+          handle would be a wildcard in the tail end of the pattern.  A search
+          could be conducted using everything but the wildcard as if it were a
+          regular search without wildcards.  If this lands on a TagNode then
+          all the rules of that node and that nodes children have matched been
+          matched by the pattern.  The other use case, #1, is not as easy to
+          implement.  For this another special TagTree could be built, however
+          it would be assembled in reverse order using only the tails of those
+          patterns starting with a wildcard.  If for example *-4-8-9 and
+          *-6-9-1 patterns are used to register rules r1, and r2 respectively
+          then the root node would contain two child nodes one for a tag int
+          equal to 9 and another equal to 1.  These two rule pattern would
+          result in two tree branches.  Again when adding new patterns we reuse
+          what already exists before branching.  To search these trees for a
+          match the reverse nesting pattern is used to walk the tree.  Take
+          for example the pattern 1-8-9.  This would start off at the root,
+          select the child node with a tag int equal to 9, then select the
+          child under node 9 with a tag int equal to 8.  At this point we can
+          go no further.  At this point a check is performed to see if the node
+          we are stuck at is a leaf node.  If it is a leaf node then we have a
+          match for the rules at that node, if not then no rules were matched.
+          For 1-8-9 no rules are matched.  Now use 1-6-4-8-9 as the nesting
+          pattern to test.  Using the same algorithm we find ourselves stuck
+          walking the tree at node 4 which is a leaf node.  In this case the
+          rules in node 4 have been matched by the nesting pattern.  Keep in
+          mind that the search against the reverse TagTree is conducted in
+          addition to the search against the forward TagTree.  Its easy to see
+          that this use case is far more painful to implement.
+        </p>
+      </subsection>
+      <subsection name="Composition Stacks">
+        <p>
+	        When rules fire they do something either to a stack or to an object
+          on a stack.  Either way rules pop, push or peek onto digester stacks
+          to do their bidding.  These stacks are used to build the final
+          product  which is a containment tree and hence are referred to as
+          composition stacks.
+        </p>
+        <p>
+	        The digester contains several stacks.  It contains a primitive Java
+          type stack for every Java primitive type.  There is a stack for
+          booleans, bytes, chars, shorts, ints, longs, floats, and doubles.
+          In addition to these stacks there is another Object stack.  Multiple
+          primitive stacks were used to avoid having to wrap primitive types
+          in their object wrappers just to push them onto an Object stack.
+          Some might think this is excessive, and others may think it
+          increases complexity.  If the use of a single Object stack is
+          thought through then it becomes apparent that neither is the case.
+        </p>
+        <p>
+          If a single Object stack were to be used then rules dealing with
+          primitive types would need to wrap them with their respective Object
+          wrappers before pushing them onto the stack.  When popping Objects
+          from the stack rules must know what type the popped object is
+          expected to be.  If multiple stacks were used it would be much
+          easier to just get the primitive from another stack rather than
+          popping the Object stack and casting the popped Object to the Object
+          wrapper just to extract the primitive value.  Having a primitive
+          stack for each Java primitive type is a PITA for the developer, but
+          it simplifies things for the user.  Plus theres the advantage of not
+          having to wrap and extract primitive types just to pass them around
+          in the stack.
+        </p>
+      </subsection>
+      <subsection name="Decoder Chaining/Stacking">
+        <p>
+          If one looks closely the digester contains a BERDecoder member.  This
+          decoder is the source of all TLV Tuple events.  The digester is
+          designed as a decoder itself that feeds incoming byte buffers to this
+          underlying BERDecoder.  An inner class, DigesterCallback, is used to
+          implement a BERDecoderCallback.  This inner class bridges BERDecoder
+          TLV Tuple events delivering them to the digester.  The digester in
+          turn updates its tag nesting stack, and processes the new TLV
+          announced with the event.  The digester detects the end of a
+          Protocol Data Unit (PDU) when it pops the last TLV tuple off of the
+          tagStack.  When this happens the constructed object which should
+          have been popped off of the Object stack is returned using a
+          callback as the root of the containment tree.
+        </p>
+        <p>
+          The digester is a great example of how decoders can be chained or
+          stacked together in almost a linear combination.  It's like stacking
+          filters on top of one another.
+        </p>
+      </subsection>
+    </section>
+  </body>

Modified: incubator/directory/snickers/trunk/xdocs/ber-codec/index.xml
--- incubator/directory/snickers/trunk/xdocs/ber-codec/index.xml	(original)
+++ incubator/directory/snickers/trunk/xdocs/ber-codec/index.xml	Fri May 28 00:19:35 2004
@@ -57,6 +57,11 @@
+          <td><a href="./BERDigesterDesign.html">BER Digester Design</a></td>
+          <td>Explains how and why the BERDigester was designed</td>
+        </tr>
+        <tr>
           <td><a href="./BEREncoderUserGuide.html">BER Encoder User Guide</a></td>
           <td>Describes how to use the BEREncoder to generate a TLV stream</td>

View raw message