directory-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Directory Wiki] Update of "TLVPageInfo" by CKoppelt
Date Fri, 16 Feb 2007 00:14:29 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Directory Wiki" for change notification.

The following page has been changed by CKoppelt:
http://wiki.apache.org/directory/TLVPageInfo

The comment on the change is:
moved to cwiki

------------------------------------------------------------------------------
+ deleted
- ##language:en
- ##master-page:Asn1Home
- ##master-date:
- == TLV Information ==
- === What are TLVs ? ===
- The acronym TLV stands for '''Tag''', '''Length''' and '''Value'''. It's a way to encode
a piece of information with a type, a length followed by the information itself. Two points
must be known :
-  * The '''Value''' part may contents other '''TLVs'''. One can see '''TLVs''' as C structures,
that can contain sub-structures.
-  * The '''Length''' part may not give the '''Value''' length : it is called an indefinite
'''Length'''. Whatever, in this - not so frequent - case, the '''Value''' must end with a
specific terminator.
  
- === A quick sample ===
- Let's begin with a simple example, without too many explanations. This is the '''PDU'''
('''P'''acket '''D'''ata '''U'''nit) of a '''BindRequest''' :
- 
- attachment:TLVs.png
- 
- We can see in this picture that you have what I called a first level TLV. It encapsulates
other TLVs. It's basically a stream of bytes.
- 
- ==== Tag ====
- Each '''Tag''' contains information about the '''Value''' part of the '''TLV'''. It tells
if the '''Value''' is a primitive or a constructed one, which type of '''primitive''' is the
value, gives some contextual information. A '''Tag''' can coded on more than one byte. The
first 3 bits give some contextual information about the tag, and the 5 following bits are
either a label or the beginning of a multi-bytes label.
- 
- Labels are numbers used to identify elements in a '''SET''' (see Asn.1 grammar), for instance.
Generally, we don't have to deal with label above 30, which can be encoded in 5 bits (so this
kind of '''Tag''' will be 1 byte long), and never above 1024. In the LDAP ASN.1 grammar, no
label exceed 19 (in ''LdapMessage'', the ''ExtendedResponse'' label is 19), so we can focus
on 1 byte tags. Whatever, it could be interesting to accept longer labels to be able to support
any LDAP evolution (or other protocols, as this '''Tag''' decoder is not specifically written
for LDAP)
- 
- ''In the current setup we use a Java primitive int to store a maximum of 4 Tag field bytes.
 The Tags within a protocol can be pre-fabricated.  An enumeration constant can be reused.
 This way all we need is a 4 byte reference to the constant to represent the tag value.  This
works well especially for using the enumeration type's value for switch statments.  Today
here's how we pack Tag bytes into a Java primitive int.''
- 
- [http://incubator.apache.org/directory/subprojects/asn1/images/tag-integer-encoding.png]
- 
- ''Looking at the automaton below it occurred to me that we can create two different Tag
decoders.  A stateless one for protocols that only requires a single byte Tag.  You get whole
bytes delivered in buffers so no need to keep state right?  The stateless Tag really simplifies
the automaton ;).  For multibyte protocols we do collect bytes in a TagOctetCollector which
is just a wrapper around a Java int.  Do you think this needs to be revamped?  In the case
of a protocol where the number of Tag octets can be greater than 1 we can use a less efficient
stateful Tag decoder which needs the full logic of the automaton.  I know this is just some
extra if-then-else logic but that adds up for every byte.  So we can swap out the Tag decoder
used based on the protocol.  Most hopefully will take the simple tag decoder and not pay too
much of a penalty.''
- 
- ''(Having two decoders will lead, I think, to more complexity than necessary. The idea,
basically, is that we don't need to have a TagDecoder, a Length Decoder and a Value Decoder.
All those decoder could perfectly be wrapped in a TLV decoder - I'm working on that, actually
-, which will be really more efficient, even if the statefull decoder is much more complicated
(ok, it's not that complex ;-). In a way, to discriminate between a single byte and a multibyte
just need an 'if' test, wich is really fast. I perfectly agree that we could switch from a
single/multi byte decoder depending on the protocol we are decoding, so why don't we keep
it beside for further optimisations, if needed?)''
- 
- ''(Note : the Tag decoder implanting this state automaton is somehow much fatser than the
one which is used : 5 times. But this number has to be check again to be sure that it does
not break the base code)''
- 
- ''BTW note that if we have a protocol with more that 30 tags we only need one multibyte
tag decoder per stream.  This is a per connection setup cost paid once and not for every Tag
that comes through.''
- 
- Decoding a Tag has to follow the finite state automaton showed on this picture :
- 
- attachment:TagStateAutomaton.png
- 
- (Thanks to Poseidon [http://www.gentleware.com/] or Argo UML [http://argouml.tigris.org/])
- 
- In this diagram, ''bb'' stands for ''ByteBuffer''. It contains the stream of bytes to be
decoded.
- 
- Other interesting information that we need to grab from a '''Tag''' are stored in the two
first bits (bit 7 and 6), and in the third bit (bit 5). The first two describe the class,
the third tells if the '''TLV''' is a ''primitive'' (b5 = 0) or a ''constructed'' '''TLV'''
(b5 = 1).
- 
- As we can see, we have to deal with the special case where the stream does not contain enough
bytes to decode a multi-byte '''Tag'''. In this case, the automaton will exit with a state
''TAG_PENDING''. So the state automaton has two different start state : ''TAG_START'' and
''TAG_PENDING''. While the ''TAG_DONE'' is not reached, we have to keep '''Tag''' data somewhere.
There are many ways to fulfill this requirement.
-  1. the '''Tag''' encoder can be instanciated each time a new '''Tag''' is to be decoded,
and it will store the current state
-  2. a session can be stored within the decoder, and will be returned back to the caller
if a ''STATE_PENDING'' state is reached. The caller will have to give back this session to
the decoder in order to finish the decoding.
-  3. the caller may have to create a container and pass it as a parameter to the decoder.
The decoder will store the current state in this container.
- 
- The second option is of no help in this simple case. It's too complicated, and will be much
slower than any of the two others options. We have to keep in mind that 99% of the '''Tag'''
will be contained in one byte, and the probability that the stream stops just in the middle
of a '''Tag''', even if not equal to zero, is very low. So we have to keep the decoding process
simple (KISS : http://digital-web.com/articles/keep_it_simple_stupid). 
- 
- I don't like the idea of instanciating new decoders when a new '''Tag''' arrives. We have
to separate action and data. 
- 
- So it leads to the third solution : calling the unique decoder with a container. It's quite
easy to implement.
- 
- ==== Length ====
- 
- '''Length''' gives the number of bytes of the '''Value''', and nothing else. So the total
length of a '''TLV''' will be :
- 
- ''TLV length = Tag length + Length length + Value length'', where ''Value length'' is stored
in the '''Length''' element.
- 
- The '''Length''' may be 0, which means that there is no value following.
- 
- How is length encoded? A '''Value''' may be from 0 to N bytes long, with N < (256 ^ 126)
- 1. This limit is purely hypothetic, of course. If we have to deal with huge objects like
pictures or movies, their length will not exceed a few MBytes or a few GBytes. Just keep in
mind that they will use the network bandwith for seconds or minutes while being transferred,
so longer values could produce a total network exhaustion.
- 
- Typically, we will find five kind of '''Values''' length : 
-  * zero length values;
-  * some of those '''Values''' have a length which is less than 128 bytes long;
-  * other could have a length between 128 and 256 ^ 4 bytes long (an ''int'' will be able
to hold 4 bytes);
-  * values above this 256 ^ 4 bytes long
-  * and values which length is not defined by the '''Length''' element.
- 
- The last type of '''Length''' could occurs if the sender does not know the length of the
value while it is sending it. '''LDAP''' protocol does not allow those kind of values, which
are dangerous because you need to read the full '''Value''' to know its length (you are not
anymore able to implements strategy that discard any '''Length''' above a certain number,
for instance)
- 
- The fourth type could also be ignored (4 GBytes is quite a huge size for an LDAP element
...), so we can decide that we won't accept those Values. It seems reasonable.
- 
- What about the decoding? We can also represent the decoding as a finite state automaton
:
- 
- attachment:LengthStateAutomaton.png
- 
- Note that an error occurs when ''expectedLength > 4 or expectedLength == 0x7F'', which
is quite redundant. In fact, those two conditions lead to different errors : 0X7F is a reserved
value, while 4 is an arbitrary length limit. You will get two different exceptions : '''LengthOverflowException'''
and '''ReservedExtensionException'''.
- 
- There is nothing special about this automaton, it's quite an obvious one. We will only store
an '''int''' in the '''Length''' container, as it contains 4 bytes. Optionnaly, we could extend
the automaton to accept more than 4 bytes length values, but its quite unlucky to be ever
necessary (we aren't storing DVD movies in a LDAP server, are we? ;-)
- 
- ==== Value ====
- 
- '''Value''' is the last element to be decoded. It's the one that contains valuable informations.
There are two kind of '''Value''' : ''Primitive Values'' and ''Constructed Values''.
- 
- ==== Primitive Value ====
- Those '''Value''' are terminal. They can't contain another TLV. In the sample, we have 4
primitive '''Values''' :
-  * 02 01 01 which is an ''Integer'' representing the Message ID #1
-  * 0A 01 00 which is an ''ENUMERATED'' which value is 0 (success for a BindResponse LDAP
Message)
-  * the last two 04 00 are OCTET STRINGs with no value, as the ''Length'' is 0.
- 
- We can see that even if we can read a '''TLV''', the semantic is carried by the upper layer.
'''TLV''' are just structured containers, no more.
- 
- A primitive value may be quite big : even if  we have limited ourself to a 4 bytes long
length, the value part could still contain 2 ^ 32 - 1 bytes (4 294 967 295 bytes ...), so
we must take care of those big values. It can't be an option to keep the full value in memory,
we must find a way to store them on disk, for instance. Of course, it's not an option either
to flush to disk all the values.... 
- 
- The last point is that a primitive value has a fixed size, given by the '''Lengh''' part
of the '''TLV'''. It will be used for ''constructed values''.
- 
- ==== Constructed Value ====
- Constructed values are '''TLV''' with inner '''TLVs''' in its '''Value'''. It has a '''Length'''
which is the sum of all its inner '''TLVs'''. A bit is set in the '''Tag''' to distinguish
a ''Constructed value'' from a ''Primitive value'' : the 5th bit of the first tag's byte.
In the sample, the first '''TLV''' tag is ''30'', which can be read 00__'''1'''__1-0000 binary.
The 5th bit (bold) is set : it's a ''constructed value''.
- 
- === Decoding TLVs ===
- For any further information, one should read [http://www.itu.int/ITU-T/studygroups/com17/languages/X.690-0207.pdf]
which explain BER encoding, but be aware that you also need to read [http://www.itu.int/ITU-T/studygroups/com17/languages/X.680-0207.pdf].
They are available for free, which is quite cheap compared to sleeping pills !
- 
- Decoding a ''Primitive Value'' is somehow easy. Decoding a ''Constructed Value'' is a recursive
process. We can't simply write a function that decode a '''TLV''' which call itself recursivly,
it's too expensive and does not allow to treat deeply recursive '''TLV''' in a constrained
memory environment. 
- 
- We will use a stack which will store the current '''TLV''' until they are totally decoded
(a totally decoded '''TLV''' is either a ''Primitive'', or a ''Constructed'' which value has
been read).
- 
- The '''TLV''' we exposed on top of this page could be seen as a tree. To decode it, we need
to walk this tree. The next picture show the sample as a tree :
- 
- attachment:TlvTree.png
- 
- Each '''TLV''' in this picture has two length : the inner length (''l'', blue) and the outer
length (''L'', red). Inner length is the '''Length''' part of a '''TLV'''. Outer length equals
length('''Tag''') + length('''Length''') + inner length. We can see that the inner length
of a constructed '''TLV''' equals the sum of all the outer length of each of its children.
- 
- This drive us to the fact that a constructed '''TLV''' is totally decoded when this sum
equals its inner length.
- 
- The next picture show the stack trace created while decoding the sample '''TLV''' :
- 
- attachment:TlvStack.png
- 
- while (buffer)
- decode(buffer, tlv);
- 
- __if__ tlv is 
- while ((token = decode(buffer)) != null)
- __if__ token is primitive
-   __then__ 
- === Encoding TLVs ===
- 

Mime
View raw message