Key concepts of creating protocol parsers in dpkt

by Oscar Ibatullin [https://github.com/obormot] a contributor/maintainer of dpkt.

Parser class definition

Let’s look at the IPv4 parser, defined in dpkt/ip.py, as an example.

class IP(dpkt.Packet):
    """Internet Protocol."""

    __hdr__ = (
        ('_v_hl', 'B', (4 << 4) | (20 >> 2)),
        ('tos', 'B', 0),
        ('len', 'H', 20),
        ('id', 'H', 0),
        ('_flags_offset', 'H', 0),
        ('ttl', 'B', 64),
        ('p', 'B', 0),
        ('sum', 'H', 0),
        ('src', '4s', b'\x00' * 4),
        ('dst', '4s', b'\x00' * 4)
    )
    __bit_fields__ = {
        '_v_hl': (
            ('v', 4),   # version, 4 bits
            ('hl', 4),  # header len, 4 bits
        ),
        '_flags_offset': (
            ('rf', 1),  # reserved bit
            ('df', 1),  # don't fragment
            ('mf', 1),  # more fragments
            ('offset', 13),  # fragment offset, 13 bits
        )
    }
    __pprint_funcs__ = {
        'dst': inet_to_str,
        'src': inet_to_str,
        'p': get_ip_proto_name
    }

A lot is going on in the header, before we even got to __init__! Here is the breakdown:

Note the main class IP inherits from dpkt.Packet
__hdr__ defines a list of fields in the protocol header as 3-item tuples: (field name, python struct format, default value). The fields are arranged in the order they appear on the wire.

Field names generally follow the protocol definitions (e.g. RFC), but there are some rules to naming the fields that affect dpkt processing:
- a name that doesn’t start with an underscore represents a regular public protocol field. Examples: tos, len, id
- a name that starts with an underscore and contains NO more underscores is considered private and gets hidden in __repr__ and pprint() outputs; this is useful for hiding fields reserved for future use, or fields that should be decoded according to some custom rules. Example: _reserved
- a name that starts with an underscore and DOES contain more underscores is similarly considered private and hidden, but gets processed as a collection of multiple protocol fields, separated by underscore. Each field name may contain up to 1 underscore as well. These fields are only created when the class definition contains matching property definitions, which could be defined explicitly or created automagically via __bit_fields__ (more on this later). Examples:
  - _foo_bar_m_flag will map to fields named foo, bar, m_flag, when the class contains properties with these names (note foo_bar_m will be ignored since it contains two underscores).
  - in the IP class the _v_hl field itself is hidden in the output of __repr__ and pprint(), and is decoded into v and hl fields that are displayed instead.
The second component of the tuple specifies the format of the protocol field, as it corresponds to Python’s native struct module. 'B' means the field will decode to an unsigned byte, 'H' - to an unsigned word, etc. The default byte order is big endian (network order). Endianness can be changed to little endian by specifying __byte_order__ = '<' in the class definition.
Next, __bit_fields__ is an optional dict that helps decode compound protocol fields, such as _v_hl or _flags_offset in the IP class. Each field name (as it appears in __hdr__) maps to a list (technically a tuple) of tuples, defining the bit fields in the network order (from high to low). Each tuple is (bit field name, size in bits).

The total sum of bit sizes must match the overall size of the placeholder field. For example, _v_hl is decoded to 1 byte ('B'), or 8 bits; v (the IP version) occupies the high 4 bits and hl (IP header length) occupies the lower 4 bits.

_flags_offset that is 2 bytes long ('H') is decoded into 3 1-bit flags followed by a 13-bit offset, total of 16 bytes.

Similarly to the naming rules of __hdr__, a bit field name starting with an underscore is made invisible in the output.

When dpkt processes __bit_fields__ it auto-creates class properties that enable interfacing with the bit fields directly, specifically: get the value (ip.v), modify the value (ip.v = 6), and reset the value back to its default (del ip.v).

In certain cases, auto-properties can’t be applied; they still can be created explicitly. Look at class SMB inside dpkt/smb.py in how it decodes the pid protocol field.
Next, __pprint_funcs__ is an optional dict that does not control protocol decoding, but helps with pretty printing of the decoded packet using the pprint() method. Each key in this map is a name of the protocol field, and each value is a callable that will be run with a single argument of the protocol field value.

For example, it’s nice to see human readable IP addresses for src and dst fields by passing the raw bytes to inet_to_str function.

Standard methods

Let’s look at the standard methods of the Packet class and how they contribute to parsing (aka unpacking or deserializing) and constructing (aka packing or serializing) the packet.

class IP(dpkt.Packet):
    ...
    def __init__(self, *args, **kwargs):
        super(IP, self).__init__(*args, **kwargs)
        ...

    def __len__(self):
        return self.__hdr_len__ + len(self.opts) + len(self.data)

    def __bytes__(self):
        # calculate IP checksum
        if self.sum == 0:
            self.sum = dpkt.in_cksum(self.pack_hdr() + bytes(self.opts))
        ...
        return self.pack_hdr() + bytes(self.opts) + bytes(self.data)

    def unpack(self, buf):
        dpkt.Packet.unpack(self, buf)
        ...
        self.opts = ...  # add IP options
        ...
        self.data = ...  # bytes that remain after unpacking

    def pack_hdr(self):
        buf = dpkt.Packet.pack_hdr(self)
        ...
        return buf

Instantiating the class with a bytes buffer (ip = dpkt.ip.IP(buf)) will trigger the unpacking sequence as follows:

__init__(buf) calls self.unpack(buf)
Packet.unpack() creates protocol fields given in __hdr__ as class attributes, and sets self.data to the remaining unparsed bytes in the buffer.

Child classes typically extend the Packet.unpack() method to create additional custom attributes, that are not given in the __hdr__ (such as opts for IP options below).

Packing is the opposite of unpacking of course; given an instance of a parsed packet, packing will return serialized packet as a bytes object (bytes(ip) => buf). It goes as follows:

Calling bytes(obj) invokes self.__bytes__(obj)
Packet.__bytes()__ calls self.pack_hdr() and returns its result with appended bytes(self.data). The latter recursively triggers serialization of self.data, which could be another packet class, e.g. Ethernet(.., data=IP(.., data=TCP(...))), so everything gets serialized.
Packet.pack_hdr() iterates over the protocol fields given in __hdr__, calls struct.pack() on them and returns the resulting bytes.

Child classes typically extend the Packet.__bytes__() method to process custom attributes, that are not given in the __hdr__, or to override some values before pack_hdr() turns them into bytes. See how the IP parser overrides __bytes__ to calculate the IP checksum prior to packing, and insert bytes(self.opts) between the packed header and data.

len

__len__() returns the size of the serialized packet and is typically invoked when calling len(obj). Note how in the IP class, this method calls other functions to calculate size, then sums the lengths together, and it does not perform serialization. It may be tempting to implement __len__ by serializing the packet into bytes and returning the size of the resulting buffer (return len(bytes(self))). While this works and is acceptable in some cases, dpkt views this as an anti-pattern that should be avoided.

repr and pprint()

These methods are provided by dpkt.Packet and are typically not overridden in the child class. However they are important to understand when developing protocol parsers. Both repr() and pprint() are responsible for the output, and both produce valid interpretable Python, but there are some differences:

__repr__ returns a short one-liner printable string, while pprint() actually prints and returns nothing
__repr__ does not include protocol fields if their value is default, i.e. it will only display a field when it differs from the default. Example: in IPv4 the version always equals 4 so normally field v is not included.
pprint() is verbose; its output is one field per line, indented, outdented and commented, and contrary to __repr__ it includes all protocol fields, even when their value IS default.
__repr__ does not use the __pprint_funcs__ and returns raw values. See below how src and dst IP addresses get human readable interpretation with pprint(), but not with __repr__.

# repr()
>>> ip
IP(len=34, p=17, sum=29376, src=b'\x01\x02\x03\x04', dst=b'\x01\x02\x03\x04', opts=b'', data=UDP(sport=111, dport=222, ulen=14, sum=48949, data=b'foobar'))

# IP version field is default and is not returned by repr()
>>> ip.v
4

>>> ip.pprint()
IP(
  v=4,
  hl=5,
  tos=0,
  len=34,
  id=0,
  rf=0,
  df=0,
  mf=0,
  offset=0,
  ttl=64,
  p=17,  # UDP
  sum=29376,
  src=b'\x01\x02\x03\x04',  # 1.2.3.4
  dst=b'\x01\x02\x03\x04',  # 1.2.3.4
  opts=b'',
  data=UDP(
    sport=111,
    dport=222,
    ulen=14,
    sum=48949,
    data=b'foobar'
  )  # UDP
)  # IP

Key concepts of creating protocol parsers in dpkt

Parser class definition

Standard methods

__len__

__repr__ and pprint()

len

repr and pprint()