nftables

nftables is a component in the Linux Netfilter subsystem that provides the most modern framework for defining packet classification and filtering functionality. As such, it is a successor to the xtables kernel component and its associated iptables / ip6tables / arptables / ebtables userspace utilities.

This documentation uses the term nftables when referring to the Linux kernel component and nft when referring to the userspace utility. Please note that the Ubuntu package that provides the userspace utility is also called nftables and this documentation will refer to it as the nftables Ubuntu package.

This page gradually builds an example nftables configuration file that is adequate for a Linux host which offers network services (such as SSH and HTTP) over IPv4 and IPv6 connectivity to a Local Area Network (LAN).

Advantages

There are several advantages to using nftables over the older alternatives:

  • The expressions forming the packet classification rules are compiled in userspace to bytecode and executed by the kernel using a purpose-built virtual machine; this allows for far more flexibility.

  • High-performance can be achieved through maps and concatenations: instead of linear rule processing (O(n)), constant time (O(1)) can be achieved.

  • The syntax used by the userspace nft utility is declarative, instead of the procedural format required for ip/ip6/arp/ebtables, simplifying management of firewall configuration.

  • Tables and chains are not predefined and the structure allows registering an arbitrary number of them: this facilitates the independent management of rules by multiple applications.

  • Packet forwarding can be accelerated by using the flowtables functionality, which also integrates with selected hardware.

  • Common rules for IPv4 and IPv6 can be defined, unlike with the older iptables and ip6tables.

  • Tracing of the rules’ evaluation for specific packets can be easily enabled.

Compatibility

In general, most of the rules that can be defined using iptables, ip6tables, arptables and ebtables can also be defined using nftables, but not the other way around. Use only one of two approaches to manage firewall rules. Otherwise, the interaction between the different rules or services that set them up may be unexpected and lead to either insecure configurations or block traffic that is meant to be allowed. Furthermore, certain applications, such as container orchestration systems or VPN utilities may configure firewall rules, resulting in unexpected rule interactions.

There are still certain xtables rules which cannot be defined using nftables, as documented in the feature compatibility nftables wiki page. The gaps have been reduced over recent nft and Linux kernel releases, which means that older Ubuntu versions might have more limited support. Additionally, it should be noted that some of the functionality available via the nftables Netlink interface may not be supported by the userspace nft utility yet (e.g. support for rules invoking eBPF programs).

Starting with Ubuntu 16.04 Xenial Xerus, the iptables package has provided versions of the iptables, ip6tables, arptables and ebtables tools that work with the nftables API and provide a compatible interface to the legacy implementation. The nftables backend, used by iptables-nft / ip6tables-nft / arptables-nft / ebtables-nft utilities, has been the default since Ubuntu 20.10 Groovy Gorilla. These are managed through the alternatives system and the current configuration can be displayed with the following commands:

update-alternatives --display iptables
update-alternatives --display ip6tables
update-alternatives --display arptables
update-alternatives --display ebtables

The iptables-nft / ip6tables-nft / arptables-nft / ebtables-nft utilities assume that no other application manages nftables rules natively and hence should not be combined with other approaches to Netfilter firewall rule management.

ufw works by invoking the legacy iptables and ip6tables utilities. As such, it should not be used concurrently with native nftables firewall rules.

Usage

nftables rules can be configured by using the userspace nft utility, which is provided by the nftables Ubuntu package. Communication with Netfilter is done over AF_NETLINK sockets, allowing applications to alternatively use this low-level interface. This documentation will only cover the use of the nft utility, with a focus on the configuration file format.

Starting with Ubuntu 15.04 Vivid Vervet, the nftables package provides a systemd service unit file that is disabled by default. If enabled, the service unit file will automatically load nftables configuration from the /etc/nftables.conf file (a mock file that does not perform any filtering is provided in the nftables package). You can enable this and load the configuration using the following commands:

sudo systemctl enable nftables.service
sudo systemctl start nftables.service

Command-line usage

The nft utility can accept one or more commands as arguments to manage any of the objects (tables, rules, sets, etc.) supported. For example, the following command will list all of the firewall rules:

sudo nft list ruleset

All operations are performed atomically: the processing of a packet will either see the firewall rules defined prior to the invocation of the utility or the firewall rules with all of the requested changes applied. The following command will create two tables that process both IPv4 and IPv6 packets:

sudo nft "add table inet foo; add table inet bar"

Changes to the nftables rulesets performed via the utility are ephemeral and will be lost upon a reboot (or, more specifically, upon the destruction of the network namespace to which they are associated). Persistence can be achieved through the aforementioned systemd service unit or similar mechanisms.

As an alternative invocation, a filename can be passed to the nft utility as an argument using the -f flag. The file can contain both commands, as well as object definitions using a declarative syntax, which are implied to be created. As with the command-line usage, all of the operations are performed atomically. The default /etc/nftables.conf file contains a command to delete all of the configured rules (flush ruleset) and a declarative definition of a table named filter that processes both IPv4 and IPv6 packets in three empty chains:

/etc/nftables.conf
 1#!/usr/sbin/nft -f
 2
 3flush ruleset
 4
 5table inet filter {
 6    chain input {
 7        type filter hook input priority filter;
 8    }
 9    chain forward {
10        type filter hook forward priority filter;
11    }
12    chain output {
13        type filter hook output priority filter;
14    }
15}

The -f option allows the nft utility to be used as an interpreter, as demonstrated in the shebang line above. Given that the file is marked as executable by default, the rules can be atomically reloaded by simply executing the file:

sudo /etc/nftables.conf

Alternatively, the systemd unit file supports the reload command to achieve the same objective:

sudo systemctl reload nftables.service

Configuration file format

The configuration file is line-oriented. Multiple commands can be combined on the same line by separating them with semicolons (;). Comments can be included by using the hash sign (#) and span until the end of the line. Commands can be split across multiple lines by escaping the end-of-line with a backslash (\); a line containing only comments will not be considered a continuation line (it will be skipped over), but an empty line would be considered a continuation line, effectively ending the rule. Whitespace (and hence, indentation) does not matter.

Even though the declarative syntax uses braces ({ and }) to define blocks containing an object’s definition, the line-oriented processing is still enforced and must be taken into account (e.g. the opening brace ({) must be on the same line as the object type and name. The following example establishes a base for a host firewall configuration file, which will be expanded upon throughout this documentation:

/etc/nftables.conf
 1#!/usr/sbin/nft -f
 2
 3# This empty definition is needed to allow the flush command to work if the
 4# table is not already defined.
 5table inet host-firewall; flush table inet host-firewall
 6
 7# Note that the flush command does not destroy the table or the objects
 8# contained within, only clearing the rules within all of the chains. Use the
 9# following instead, if the object definitions need to be changed, chains
10# completely destroyed or sets/maps cleared.
11#destroy table inet host-firewall
12
13table inet host-firewall {
14    chain firewall-input {
15        # Process packets destined for this host.
16        type filter hook input priority filter;
17        # Use a default-deny policy for packets.
18        policy drop;
19    }
20}

When using configuration files to load firewall rules, it is imperative to clear the prior configuration. The declarative syntax will not replace chain rules, but rather append them at the end of the previously defined chain. Deciding which command to use to clear the prior configuration depends on several considerations:

  • flush ruleset will clear the entire nftables configuration, including all tables, chains and rules, sets or maps, stateful objects and their contents, and flowtables. While appropriate for a single central definition of the firewall rules, it may lead to unexpected results if elements in sets or maps are managed externally or if tables are managed by any other application.

  • destroy table will delete a table and all objects associated with them, including chains and rules, sets or maps, stateful objects, and flowtables. This is particularly useful when settings associated to the objects can be changed across versions of the configuration file (such as the priority associated to a base chain) or when elements in sets or maps should be deleted and recreated.

  • flush table will clear the rules within tables, but will not delete the chains, sets and maps or their elements, Stateful objects or flowtables. This is appropriate when elements in sets or maps are managed externally or state kept by stateful objects should not be reset.

The include directive

Files can be included by using the include directive. These are interpreted in the context in which the directive is used. For example, the following highlighted lines allow drop-in files to add rules to the defined firewall-input chain from /etc/nftables/input-rules.d/ and any other tables to be defined in files under /etc/nftables/tables.d/ (if wildcards are used, the files need not exist):

/etc/nftables.conf
 1#!/usr/sbin/nft -f
 2
 3# This empty definition is needed to allow the flush command to work if the
 4# table is not already defined.
 5table inet host-firewall; flush table inet host-firewall
 6
 7# Note that the flush command does not destroy the table or the objects
 8# contained within, only clearing the rules within all of the chains. Use the
 9# following instead, if the object definitions need to be changed, chains
10# completely destroyed or sets/maps cleared.
11#destroy table inet host-firewall
12
13table inet host-firewall {
14    chain firewall-input {
15        # Process packets destined for this host.
16        type filter hook input priority filter;
17        # Use a default-deny policy for packets.
18        policy drop;
19
20        # Drop-in files can add rules here.
21        include "/etc/nftables/input-rules.d/*.conf"
22    }
23}
24
25include "/etc/nftables/tables.d/*.conf"

Symbolic variables

Symbolic variables increase the maintainability of the firewall rules by associating names to arbitrary expressions, which can then be reused throughout the configuration. Associating the name IF_LOOPBACK to the interface name lo (the standard Linux loopback interface) allows defining a rule that references it, as the following highlighted lines show:

/etc/nftables.conf
 1#!/usr/sbin/nft -f
 2
 3define IF_LOOPBACK = lo
 4
 5# This empty definition is needed to allow the flush command to work if the
 6# table is not already defined.
 7table inet host-firewall; flush table inet host-firewall
 8
 9# Note that the flush command does not destroy the table or the objects
10# contained within, only clearing the rules within all of the chains. Use the
11# following instead, if the object definitions need to be changed, chains
12# completely destroyed or sets/maps cleared.
13#destroy table inet host-firewall
14
15table inet host-firewall {
16    chain firewall-input {
17        # Process packets destined for this host.
18        type filter hook input priority filter;
19        # Use a default-deny policy for packets.
20        policy drop;
21
22        # Allow traffic on the loopback interface(s).
23        meta iif $IF_LOOPBACK accept
24
25        # Drop-in files can add rules here.
26        include "/etc/nftables/input-rules.d/*.conf"
27    }
28}
29
30include "/etc/nftables/tables.d/*.conf"

If, at a later date, a new loopback interface is created, the set notation can be taken advantage of to only modify the symbolic variable:

define IF_LOOPBACK = { lo, lo1 }

The scope of the symbolic variables is the file interpreted by the nft utility (and any included files), but restricted to the block in which it is defined and all inner blocks, in order to reduce clashes. The symbolic variable is only interpreted in userspace. Any other configuration file passed to nft would not be able to reference it. Similarly, retrieving the ruleset installed in nftables (such as by using the nft list ruleset command) would reconstruct the rules, but without any references to symbolic variables.

Debugging

nftables provides several means to debug firewall rules:

  • Using the log statement, which can be associated with any rule and will result in packet information being logged, either to the kernel log (which can read via dmesg) or to a userspace application. This is described in more detail in the nftables documentation and in the manual page, under LOG STATEMENT.

  • Setting the nftrace flag on a packet, which allows tracing all of the rules which a packet matches, within all chains and all tables, also identifying any actions taken. This is described in more detail in the nftables documentation.

Log statement

The highlighted lines in the following example demonstrate the use of the log statement to send any packets coming in on the loopback interface to the kernel log, before accepting them:

/etc/nftables.conf
 1#!/usr/sbin/nft -f
 2
 3define IF_LOOPBACK = lo
 4
 5# This empty definition is needed to allow the flush command to work if the
 6# table is not already defined.
 7table inet host-firewall; flush table inet host-firewall
 8
 9# Note that the flush command does not destroy the table or the objects
10# contained within, only clearing the rules within all of the chains. Use the
11# following instead, if the object definitions need to be changed, chains
12# completely destroyed or sets/maps cleared.
13#destroy table inet host-firewall
14
15table inet host-firewall {
16    chain firewall-input {
17        # Process packets destined for this host.
18        type filter hook input priority filter;
19        # Use a default-deny policy for packets.
20        policy drop;
21
22        # Allow traffic on the loopback interface(s).
23        meta iif $IF_LOOPBACK \
24            # Log the packets...
25            log prefix "loopback packet: " \
26            # ...and accept them.
27            accept
28
29
30        # Drop-in files can add rules here.
31        include "/etc/nftables/input-rules.d/*.conf"
32    }
33}
34
35include "/etc/nftables/tables.d/*.conf"

Checking dmesg would show messages such as the following (assuming packets are actually flowing through the loopback interface):

[694077.575927] loopback packet: IN=lo OUT= MAC=00:00:00:00:00:00:00:00:00:00:00:00:08:00 SRC=127.0.0.1 DST=127.0.0.53 LEN=73 TOS=0x00 PREC=0x00 TTL=64 ID=24453 DF PROTO=UDP SPT=37969 DPT=53 LEN=53
Rule tracing

The nftrace flag enables tracing of a packet’s flow through nftables rules across chains and tables, from the moment the flag is set to the moment the packet processing is completed or the flag is cleared. This functionality allows complex debugging of nftables firewall rules. The packet information, along with references to the rules traversed is sent to a userspace application through the netlink interface. The nft monitor trace command can be used to receive this information.

The meta nftrace set 1 statement can be combined with a match expression to set the flag, while meta nftrace set 0 will clear it. If all the rules traversed are to be identified, the flag should be set as early as possible. The highlighted lines in the following example create two chains attached to the prerouting and output hooks, running as early as feasible (even before other chains registered at the raw priority):

/etc/nftables.conf
 1#!/usr/sbin/nft -f
 2
 3define IF_LOOPBACK = lo
 4
 5# This empty definition is needed to allow the flush command to work if the
 6# table is not already defined.
 7table inet host-firewall; flush table inet host-firewall
 8
 9# Note that the flush command does not destroy the table or the objects
10# contained within, only clearing the rules within all of the chains. Use the
11# following instead, if the object definitions need to be changed, chains
12# completely destroyed or sets/maps cleared.
13#destroy table inet host-firewall
14
15table inet host-firewall {
16    chain trace-inbound {
17        # Process after reassembly and conntrack lookup, but before other
18        # potential raw chains.
19        type filter hook prerouting priority raw - 10; policy accept;
20
21        meta l4proto udp meta nftrace set 1
22    }
23
24    chain trace-outbound {
25        # Process after conntrack lookup, but before other potential raw chains.
26        type filter hook output priority raw - 10; policy accept;
27
28        meta l4proto udp meta nftrace set 1
29    }
30
31    chain firewall-input {
32        # Process packets destined for this host.
33        type filter hook input priority filter;
34        # Use a default-deny policy for packets.
35        policy drop;
36
37        # Allow traffic on the loopback interface(s).
38        meta iif $IF_LOOPBACK accept
39
40        # Drop-in files can add rules here.
41        include "/etc/nftables/input-rules.d/*.conf"
42    }
43}
44
45include "/etc/nftables/tables.d/*.conf"

The two rules will only match UDP datagrams, but irrespective of whether they’re transported by IPv4 or IPv6 (meta l4proto udp) and then activate rule tracing for those packets (meta nftrace set 1). Running the nft monitor trace command will produce messages such as:

trace id 78653943 inet host-firewall trace-inbound packet: iif "lo" @ll,0,112 0x800 ip saddr 127.0.0.53 ip daddr 127.0.0.1 ip dscp cs0 ip ecn not-ect ip ttl 1 ip id 64669 ip protocol udp ip length 168 udp sport 53 udp dport 36520 udp length 148 @th,64,96 0x2e4881800001000100000004
trace id 78653943 inet host-firewall trace-inbound rule meta l4proto udp meta nftrace set 1 (verdict continue)
trace id 78653943 inet host-firewall trace-inbound policy accept
trace id 78653943 inet host-firewall firewall-input packet: iif "lo" @ll,0,112 0x800 ip saddr 127.0.0.53 ip daddr 127.0.0.1 ip dscp cs0 ip ecn not-ect ip ttl 1 ip id 64669 ip protocol udp ip length 168 udp sport 53 udp dport 36520 udp length 148 @th,64,96 0x2e4881800001000100000004
trace id 78653943 inet host-firewall firewall-input rule ct state established,related accept (verdict accept)

The trace id will be the same for the same packet across different tables and chains, allowing correlation between different output lines. Whenever a packet starts being handled by a chain, a packet line is output with information about the contents of the packet.

It should be noted that the tracing notifications received by the nft monitor trace utility only contain identifier references to the tables, chains and rules. nft monitor trace reads all of the rules when it is first started. The table and chain names and actual rule content are reconstructed from that initial read for every logged packet. This means that if the rules are changed after the nft monitor trace utility is started, the output will either be incomplete or inaccurate (because rule identifiers (handles), in particular, can be reused), so a printed rule may not be the actual rule that a packet matched.

Bytecode inspection

The bytecode interpreted by the Linux kernel can be observed by using the --debug=netlink argument to nft. This works both for commands that modify rules (e.g. adding a new rule), as well as for those that retrieve rules. The bytecode is printed alongside the rule handle (rule identifier). The --handle option can also be useful to print the handles associated with each rule.

For example, listing the trace-inbound chain created above:

sudo nft --handle --debug=netlink list chain inet host-firewall trace-inbound

Produces the following output:

inet host-firewall trace-inbound 13
  [ meta load l4proto => reg 1 ]
  [ cmp eq reg 1 0x00000011 ]
  [ immediate reg 1 0x00000001 ]
  [ meta set nftrace with reg 1 ]

table inet host-firewall {
    chain trace-inbound { # handle 1
        type filter hook prerouting priority raw - 10; policy accept;
        meta l4proto udp meta nftrace set 1 # handle 13
    }
}

Netfilter integration

The nftables component is integrated into the existing Netfilter subsystem and uses the same hooks, stateful processing for connection tracking or Network Address Translation (NAT), and functionality for userspace packet queueing and processing as the xtables subsystem.

A high-level understanding of the Netfilter framework is important for managing firewall rules. This section provides the necessary information and references additional documentation.

Packet flow

A packet starts being handled by the Linux networking subsystem (and, by extension, by Netfilter) through one of three options:

  • it is received by a network interface driver (whether for a physical Network Interface Controller (NIC) or a virtual one);

  • it is generated by an application process on the system (via a socket);

  • it is generated by the kernel.

Netfilter is integrated into the wider Linux network subsystem. Packet processing will go through multiple decision points, potentially modifying the packet, such as:

  • fragment reassembly;

  • connection tracking;

  • routing decisions;

  • source and destination NAT (including port translation).

Netfilter provides hooks that allow Netfilter components to process a packet at various stages. These are used by both nftables and xtables to execute user-defined rules. In particular, the names of the predefined chains in the legacy iptables / ip6tables / ebtables / arptables utilities are derived from names of the Netfilter hooks:

  • ingress (only available for nftables)

  • prerouting (for bridge and IP)

  • input (for ARP, bridge and IP)

  • forward (for bridge and IP)

  • postrouting (for bridge and IP)

  • output (for ARP, bridge and IP)

  • egress (only available for nftables)

Packets will not traverse all hook points, depending on some of the decisions made during the processing. This is represented graphically in the diagram on the Netfilter hooks nftables wiki page. In particular, the use of bridges will result in a different packet flow, but one which partially overlaps with the flow taken by non-bridged packets.

It should be noted that some of the standard packet processing is performed at some of the hook points (fragment reassembly, connection tracking lookup, NAT), while others are in-between hook points (routing decision). At each hook point, the order of operations is defined by a priority. For example, these are some of the standard operations executed at the IP layer prerouting hook:

Netfilter priority value

Operation

-400

fragment reassembly

-200

connection tracking lookup and association

-100

destination NAT

The list of functions, including nftables chains and standard Netfilter processing, that have been registered can be listed with the following command (note that the numerical values are the priorities and that they are listed in decimal format):

sudo nft list hooks

If you register rules to be executed at a priority value lower than -400 (e.g. -500), these will be executed before IP datagram fragments are reassembled. As such the rules may see IP datagram fragments for which the transport header may not be available, because they are not the first fragment. On the other hand, rules registered at priority value higher than -400 (e.g. -300) would not be able to make decisions based on fragmentation information (the packet would look as if the entire IP datagram was received).

The priority values themselves do not hold any intrinsic meaning, other than the fact that some standard operations are executed at well-known priority values. For example, in absence of other context, registering rules at priority 1000 is no different from using priority 1500. The nftables documentation lists the well-known priority values.

A packet stops being handled by the Linux networking subsystem, implying that no more Netfilter hooks would be invoked, when one of these conditions occur:

  • the packet is dropped, either through a firewall rule or some other condition in the standard processing (e.g. blackhole route);

  • the packet is passed to an application process on the system (via a socket);

  • the packet is handled by the kernel (e.g. ICMP echo request - a ping);

  • the packet is sent out a network interface (whether for a physical NIC or a virtual one).

It should be noted that a particular packet can traverse the Netfilter hooks several times, in conditions such as the followings:

  • the packet is sent out a virtual interface that loops the packet back to the same Linux kernel (e.g. veth interfaces), although the list of hooks are not going to overlap completely; FIXME: does this even make sense? Of course a packet sent out a veth is going to come back in on the pair… nfmark is not maintained - for all intents and purposes, this is a new packet.

  • Virtual Routing and Forwarding (VRF) is in use - a packet will traverse the L3 prerouting hook twice, once with the input interface set to the L3 interface and once with the input interface set to the VRF interface.

  • the packet is processed and reinjected by the kernel into the networking stack (e.g. after IPsec encryption/decryption and ESP encapsulation/decapsulation in tunnel mode - although the packet is admittedly different, some of the state is maintained across this operation, such as the Netfilter mark). FIXME: technically, this is a different packet - does it even make sense? Same applies to other L3 encapsulations, e.g. vxlan.

The Netfilter hooks and, hence, the nftables rules are managed independently per network namespace. As such, different firewall rules are configured in each network namespace, facilitating functionality such as containers. This also means that if the two ends of a veth pair are associated with different namespaces, they will be processed by independent firewall rules.

Structure

nftables structures objects for managing the firewall in a hiearchy. The primary terminology used is:

  • Rulesets: this refers to all of the objects defined in nftables; the command nft list ruleset will output everything defined in nftables (within a particular network namespace), while nft flush ruleset will destroy all of the objects: tables, sets, maps, etc. This includes elements defined in sets and maps, or the contents of stateful objects (e.g. counter values). As such, a command such as the following is effectively a no-op (although the state may change between the moment it is read and the moment it is overwritten): (echo "nft flush rulset"; nft list ruleset) | nft -f -.

  • Tables: unlike xtables, any number of tables can be defined in nftables. These are collections of chains, sets, maps and stateful objects (e.g. counters). The table name does not hold any intrinsic meaning and can be named by system administrators or applications as desired. Tables are associated with an address family, dictating limitations on chains and determining what Netfilter hooks the chains will be associated with. The address families are documented in the manual page. These are:

    • ip: for IPv4 packets, as what the legacy iptables utility would manage.

    • ip6: for IPv6 packets, as what the legacy ip6tables utility would manage.

    • inet: for both IPv4 and IPv6 packets, simplifying management of consistent rules across both network protocols.

    • arp: for IPv4 ARP packets, as what the legacy arptables utility would manage.

    • bridge: for Ethernet packets traversing bridges, as what the legacy ebtables utility would manage.

    • netdev: for very early (on ingress) or very late (on egress) packet processing. This is useful for efficient filtering or load balancing, but imposes limitations, such as only supporting the ingress and egress hooks and requiring strict association of chains with a single network interface. Note that starting with Linux 5.10, the inet family also supports the ingress hook without the single network interface limitation, largely reducing the usefulness of the netdev address family; using a single table in the inet family would also facilitate the sharing of sets and maps with chains registered at other hooks.

  • Chains: containers for firewall rules; similarly to xtables, there is a distinction between base chains and regular chains. Unlike in xtables, the base chains are not predefined and as many as necessary can be created, including multiple chains at the same hooks (with or without the same priority).

    • base chains have a type, a policy and are registered with a Netfilter hook point at a specific priority. They can also have additional attributes, as described in the manual page. Their rules are evaluated whenever packet processing traverses the specified Netfilter hook.

    • regular chains are simply called upon by rules in other chains and can be thought of as subprocedures. They are useful to simplify maintenance of rules or to optimize rule processing (e.g. by using verdict maps). Rules within a regular chain are not evaluated during the processing of a packet, unless called upon, directly or indirectly, from a base chain.

For base chains, the most important attributes are:

  • type: dictates the conditions on which a packet gets processed by the chain and the available hooks. Some statements are only available in certain chain types. The possible values are:

    • filter: generic type, applicable to all address families and all hooks. Used for typical firewall actions, as well as arbitrary packet modifications.

    • nat: this is equivalent to the chains defined in the legacy iptables / ip6tables nat table. Only the first packet of a connection is processed by chains of this type. NAT actions (snat, dnat, masquerade, redirect) can only be taken in these chains.

    • route: this has no equivalent in xtables, but allows the integration of the nftables rules with policy routing. This can only be used with locally-generated packets (either from processes or the kernel), with the only hook available being output. As per the Netfilter flow diagram, the routing decision for locally-generated packets is performed before any hooks. However, if the rules in a chain of type route modify parts of a packet or its metadata (e.g. the Netfilter mark) that are used in policy routing decisions, another route lookup will be performed. Packets received from a network interface do not require this special chain type, as there are several hooks available that can prepare a packet before it goes through routing decisions.

  • hook: the processing point at which rules are evaluated, as described in the Packet flow section. It should be noted that not all hooks are available for all address families and all chain types. The restrictions are listed in the Chains section of the manual page.

  • priority: dictates the order in which chains and other standard Netfilter operations are performed at a particular hook point, as described in the Packet flow section. Can be given as either a symbolic name (e.g. filter, raw, mangle), a signed integer (e.g. 0, -300) or a value relative to a symbolic name (e.g. raw - 10). You should note that symbolic names may map to different integer values, depending on the address family (filter is 0 for inet, ip, ip6, arp and netdev, but -200 for bridge).

  • policy: dictates the verdict that is associated with a packet, if, during processing, none of the matched rules have a verdict. It must be one of accept (the default) or drop.

It should be noted that, as described in the Packet flow section, a packet stops being handled by the networking subsystem when it is either dropped or it traverses the entire processing flow and is either sent out to an interface or handled by an application or the kernel. As such, a verdict of drop is final for a packet, but one of accept is not: it is sufficient for one chain in one table to drop a packet for it to be discarded, but the packet must be accept-ed by all chains in all tables for it to continue its journey (i.e. an accept verdict only terminates the processing in a particular base chain, but does not influence the processing in any other base chains the packet will subsequently traverse).

Rule composition

Rules are composed of expressions and statements, both of which are optional. Expressions are used to match packets, while statements dictate what actions should be be taken. A rule without statements is valid and can be used for debugging purposes, as it will be reported by the rule tracing for any matched packets. For example, the following highlighted rule will match locally-generated IPv4 UDP packets without taking any actions (note the use of the ip protocol udp expression, as opposed to meta l4proto udp: this will match only IPv4 packets):

/etc/nftables/tables.d/test-firewall.conf
 1#!/usr/sbin/nft -f
 2
 3destroy table inet test-firewall
 4table inet test-firewall {
 5    chain test-outbound {
 6        type filter hook output priority filter; policy accept;
 7
 8        ip protocol udp
 9    }
10}

Expressions within a rule are combined with a logical AND when evaluated: all of them must succeed for the rule’s statements to be executed. Combining expressions with a logical OR requires the use of multiple rules, sets, maps or intervals. In the following example, the first rule will match both IPv4 and IPv6 packets if both the transport protocol is UDP (meta l4proto udp) and the destination port is 53 (udp dport 53). The second rule will match packets if the network protocol is IPv4 (implied), the transport protocol is UDP (implied) and either:

  • the IPv4 destination address is 10.1.1.1 and the destination port is 53

  • the IPv4 destination address is 10.2.2.2 and the destination port is 80 or 443

The ip daddr . udp dport syntax is explained in the Concatenations section.

/etc/nftables/tables.d/test-firewall.conf
 1#!/usr/sbin/nft -f
 2
 3destroy table inet test-firewall
 4table inet test-firewall {
 5    chain test-outbound {
 6        type filter hook output priority filter; policy accept;
 7
 8        # Transport protocol is UDP and destination port is 53.
 9        meta l4proto udp udp dport 53
10
11        # Network protocol is IPv4, transport protocol is UDP and the
12        # combination of IPv4 destination address and UDP destination port
13        # is one of the following:
14        ip daddr . udp dport {
15            10.1.1.1 . 53,
16            10.2.2.2 . 80,
17            10.2.2.2 . 443
18        }
19    }
20}

A rule can contain zero or more statements. There are two types of statements: terminal and non-terminal. Terminal statements unconditionally terminate the rule’s evaluation and may also terminate the chain’s evaluation or entirely stop the pocket’s processing. Non-terminal statements result in actions which either do not terminate the rule’s evaluation or only do so conditionally. The only limitation is that a rule may have at most one terminal statement, which must also be placed last. Most of the verdict statements are terminal statements, but there are also some non-verdict terminal statements (e.g. reject, which drops a packet and generates an ICMP or TCP reset response).

Verdict statements

Verdict statements affect the control flow of rule evaluation, with most of them (apart from continue) being terminal statements. The continue statement is implied, if no other terminal statement is associated with a rule. The following is the list of verdict statements:

  • accept: terminates the processing of the packet in the current base chain, allowing the packet to continue its journey within Netfilter and the Linux networking subsystem. Other base chains registered at the current hook, registered with a numerical priority value that is higher will still evaluate the packet and may still drop it. Using this statement in a regular chain called, directly or indirectly, from a base chain stops the processing of all subsequent rules, both in the current chain and in chains higher up the call stack.

  • drop: terminates the processing of the packet within the Linux networking subsystem with no further action. This statement is the basis of a firewall implementation. No further base chains are invoked.

  • queue: terminates the processing of the packet in the current base chain and passes the packet to userspace for further processing. The userspace must provide a verdict of accept or drop. This is explained in the nftables userspace queueing documentation.

  • continue: implied action if no other terminal statement is issued: the rules’ evaluation continues with the next rule in current chain.

  • jump: continue processing in a new regular chain; upon completion, processing returns to the current chain, unless a processing-terminating statement (such as accept, drop, queue or reject) is issued in one of the invoked chains. From a procedural programming perspective, this is similar to invoking a subprocedure (pseudocode: call subprocedure()).

  • goto: continue processing in a new regular chain; upon completion, the processing does not return to the current chain, but the chain higher up in the call stack (if the current chain is a base chain, the policy action is taken, instead). From a procedural programming perspective, this is similar to invoking and returning the result of a subprocedure (pseudocode: return subprocedure()).

The following example extends the previous firewall definition with the skeleton structure for two new functions in the highlighted lines, demonstrating some control flow functionality:

  • Setting the Netfilter packet mark for inbound packets to represent where the packet originated from, in order to allow subsequent rules to make decisions based on this criteria. We’re calling this the realm, but it should not be confused with iproute2 realms. For example, the rules below set the mark to the value 1 (via the symbolic variable MARK_REALM_LOCAL) if the packet was received on one of the loopback interfaces. Two new chains are introduced: early-inbound (a base chain) and mark-inbound-determine (a regular chain).

    • When packet processing follows the packet through an input VRF interface (meta iifkind "vrf"), we’re terminating the packet processing in this chain via return. The return statement, as it is contained in a base chain, is equivalent to the invocation of the chain’s policy (accept in this instance).

    • If a packet comes in to this chain with a non-zero packet mark (meta mark != 0), a condition which can occur when functionality such as the GBP extension of VXLAN are in use, the packet is dropped completely.

    • The mark-inbound-determine regular chain is invoked via a jump mark-inbound-determine; this allows subsequent rules in the early-inbound chain to be evaluated.

    • In the mark-inbound-determine chain, if a packet is received on one of the interfaces defined in the IF_LOOPBACK symbolic variable (meta iif $IF_LOOPBACK), two statements are executed:

      • the packet mark is set to the MARK_REALM_LOCAL value, defined as 1 (meta mark set $MARK_REALM_LOCAL), a non-terminal statement;

      • the processing in the mark-inbound-determine chain is terminated via a return statement, with the packet continuing its processing in the caller chain (early-inbound).

  • In the firewall-input base chain, processing of multicast packets is delegated to the firewall-input-multicast regular chain. This allows multicast logic to be encapsulated in a separate chain, aiding maintainability. The base configuration accepts IPv4 IGMP packets, as they are needed for a standard multicast-ready network topology with multicast queriers (and possibly bridge multicast snooping, required for efficient layer-2 multicast forwarding). The IPv6-equivalent MLD rule is introduced in the Sets section. Separately, Multicast DNS (mDNS) packets, which are also used by the DNS Service Discovery (DNS-SD) protocol, are allowed through a separate rule (udp dport 5353 accept). mDNS is typically used in LANs for ad-hoc service discovery, such as for network printers and network shares. In Ubuntu, applications such as Avahi (generic service discovery) and cups-browsed (network printer discovery) make use of these protocols.

    • The goto statement ensures that the subsequent rules in firewall-input are not evaluated, even if the called chain executes a return statements or some packets are not matched by any rules; instead, the policy (drop) will apply in these instances.

    • The accept statement is necessary in the firewall-input-multicast chain to allow packets through. Once one of the conditions is reached (e.g. ip protocol igmp), the processing is finalised and no further rules in firewall-input-multicast or firewall-input are evaluated.

/etc/nftables.conf
 1#!/usr/sbin/nft -f
 2
 3define IF_LOOPBACK = lo
 4
 5define MARK_REALM_LOCAL = 1
 6
 7# This empty definition is needed to allow the flush command to work if the
 8# table is not already defined.
 9table inet host-firewall; flush table inet host-firewall
10
11# Note that the flush command does not destroy the table or the objects
12# contained within, only clearing the rules within all of the chains. Use the
13# following instead, if the object definitions need to be changed, chains
14# completely destroyed or sets/maps cleared.
15#destroy table inet host-firewall
16
17table inet host-firewall {
18    chain early-inbound {
19        type filter hook prerouting priority raw; policy accept;
20
21        # When VRF interfaces are in use, packets go through the prerouting hook
22        # twice, once with the VRF interface set as input and another time with
23        # actual interface set as input.
24        meta iifkind "vrf" return
25
26        # Do not allow inbound packets that have an externally-determined packet
27        # mark (this is possible, for example, by using VXLAN with the GBP
28        # extension).
29        meta mark != 0 drop
30        jump mark-inbound-determine
31    }
32
33    chain mark-inbound-determine {
34        # Set the realm to LOCAL for packets received on the loopback interface.
35        meta iif $IF_LOOPBACK meta mark set $MARK_REALM_LOCAL return
36    }
37
38    chain firewall-input {
39        # Process packets destined for this host.
40        type filter hook input priority filter;
41        # Use a default-deny policy for packets.
42        policy drop;
43
44        # Allow traffic on the loopback interface(s).
45        meta iif $IF_LOOPBACK accept
46
47        # Process multicast packets. Upon returning, do not evaluate any more
48        # rules and apply the policy verdict (drop).
49        meta pkttype multicast goto firewall-input-multicast
50
51        # Drop-in files can add rules here.
52        include "/etc/nftables/input-rules.d/*.conf"
53    }
54
55    chain firewall-input-multicast {
56        # Allow any IPv4 IGMP.
57        ip protocol igmp accept
58
59        # Allow inbound Multicast DNS packets.
60        udp dport 5353 accept
61
62        # If no prior action was taken, this will return to the calling chain
63        # (firewall-input).
64    }
65}
66
67include "/etc/nftables/tables.d/*.conf"

Other statements

nftables supports a large number of statements. These are documented in the Statements section of the manual package. While this document is not meant to exhaustively list all of them, some of the more commonly-used ones are:

  • reject statement: drops a packet, but also generates an appropriate ICMP or TCP reset response. For example, the rule udp dport 389 reject with icmpx admin-prohibited will match packets destinated for the LDAP port (udp dport 389) and generate a network-protocol-appropriate admin-prohibited ICMP response (type 3 code 13 for IPv4 and type 1 code 1 for IPv6).

  • log statement: described in the Log statement section.

  • meta statements: allows changing meta information tracked by Netfilter for a particular packet, such as meta mark set 42 for setting the Netfilter packet mark to the constant value 42 or meta nftrace set 1 for enabling rule tracing.

  • nat statements: allow source and destination network address translation (NAT) to occur (including support for dynamic translation for transport protocol ports and the stateful processing of ICMP packets).

  • counter statements: support for counting packets and bytes matched by rules.

  • payload statements: allows changing arbitrary contents of the packets: for example, ip dscp set 46 sets the IPv4 DSCP field to 46 (EF - Expedited Forwarding).

  • set statement: allows dynamically adding elements to sets and maps. These are explained in the respective sections.

  • map statement: allows looking up elements in a map by an arbitrary key and returning the associated value for use as an argument to a different statement. This is an example of a non-terminal statement that can conditionally terminate the rule’s processing, if no element in the map matches the input key. It is explained in more detail in the Maps section.

  • vmap statement: allows dynamically determining the verdict for a rule based on an abitrary key and is explained in more detail in the Verdict maps section.

Expressions

nftables expressive power comes from its implementation of a bytecode-based virtual machine for the evaluation of expressions. An expression has an associated data type, which determines how operations are evaluated on the expression and how it can be combined with other expressions or used as arguments to a statement. For example, the meta mark expression has an 32-bit integer data type. These are listed in the manual page’s Data Types section. Please note that the data types are a feature of the nft userspace utility, with the bytecode interpreted by the Linux kernel operating exclusively on raw bytes.

An expression’s data type can be displayed using the nft describe command, such as the following:

nft describe udp dport

The expressions generally follow the convention of a class followed by an attribute (e.g. udp dport, ip protocol or meta mark). These are documented in the manual page in the Primary Expressions and Payload Expressions sections.

Expressions can be combined with comparison operators to form relational expressions, which are used for matching packets. These are:

  • eq or ==: this is the implied comparison (udp dport 53 is equivalent to udp dport == 53). It can compare an arbitrary expression with a constant value or look the expression up in a set (udp dport == { 80, 443 } matches if the destination port is either 80 or 443).

  • ne or !=: this matches if an arbitrary expression is not equal to a constant value (e.g. udp dport != 53) or does not exist in a set (udp dport != { 80, 443 } matches if the destination is neither 80, nor 443).

  • lt / <, gt / >, le / <= and ge / >=: these match if the comparison of an arbitrary expression is lower than, greater than, lower than or equal and greater than or equal, respectively, to a constant value (e.g. udp dport < 1024 matches privileged UDP ports).

Expressions can also be combinated with binary operators, such as:

  • and / &: bitwise AND

  • or / |: bitwise OR

  • xor / ^: bitwise exclusive-OR

  • lshift / <<: bitwise left shift

  • rshift / >>: bitwise right shift

The right-hand side of the binary operators must be a constant expression. For example, the following expression would match IPv4 packets for which the second most-significant byte of the destination IP address is smaller than 16:

(ip daddr >> 16) & 0xFF < 0x10

Then same condition can be written as:

ip daddr & 0x00F00000 == 0

Or, in a rather less readable manner, as:

ip daddr & 0x00F00000 0

The equality and non-equality operators can also be used with intervals, matching if the expression’s value is (or, respectively, isn’t) within the closed interval. The following expression matches IPv4 packets for which the destination address has the form A.B.C.D, with B having a value between 10 and 20 (inclusive):

(ip daddr >> 16) & 0xFF == 10-20

IPv4 and IPv6 addresses also support prefix notation, with the following matching if the destination IPv4 address is not one of the RFC1918 private addresses:

ip daddr != { 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16 }

Not all of the operations are supported by all data types. For example, IPv6 addresses do not support bit shifting, and the and, or and xor operators require full IPv6 addresses on the right-hand side, as do the comparison operators.

Combining expression operators with statements that support expressions is also possible. For example, the following expression sets the Netfilter packet mark to the least-significant 16 bits of the IPv4 source address, combined with bit 16 set, but only if the IPv4 source address is within the 10.0.0.0/16 prefix.

ip saddr 10.0.0.0/16 meta mark set (ip saddr & 0xFFFF) | 0x10000

Bitmasks support specific operations that simplify management, especially through the use of symbolic names associated to individual bits:

  • Without an operator, a relational expression matches if any of the specified bits are set. The expression tcp flags syn,ack matches if packets have at least one of the SYN or ACK bits set. This is equivalent to tcp flags & (syn|ack) != 0.

  • The / operator can be used to specify a mask, in addition to a set of values that need to be configured. The expression tcp flags syn / syn,ack matches if, out of the SYN and ACK bits, only the SYN bit is set (no other bits matter). This is equivalent to tcp flags & (syn|ack) == syn.

  • The equality (eq / ==) and non-equality (ne / !=) operators compare an exact bitmask value. The expression tcp flags == syn,ack matches if and only if both the SYN and ACK bits are set and all other bits are cleared. This is equivalent to tcp flags == (syn|ack).

Putting these concepts together allows the creation of a framework for using the Netfilter mark as a bitfield that facilitates generic firewall rules. As the packet mark can be determined from external sources (e.g. VXLAN with the GBP extension) and is copied when packets are decapsulated (e.g. IPsec), special processing is required: one bit (a flag) is used to determine if the packet mark can be trusted as having been validated locally.

The following nftables configuration containss two changes from the previous example in the highlighted lines:

  • An extension to the early-inbound chain, with the two regular chains that it invokes (mark-inbound-determine and mark-inbound-external-validate). The convention used for the format of the Netfilter mark is explained in a comment at the top of the file, with symbolic variables defined to simplify the bitfield operations.

  • Two new rules in the firewall-input chain that use the ct state bitmask expression:

    • allow packets marked as either established or related by the conntrack module through (ct state established,related accept);

    • drop packets marked as invalid by the conntrack module (ct state invalid drop).

/etc/nftables.conf
  1#!/usr/sbin/nft -f
  2
  3define IF_LOOPBACK = lo
  4
  5# The packet mark is interpreted as follows (big endian):
  6#    3                   2                   1                   0
  7#  1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
  8# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  9# |V| Unused                                          | Realm (6) |
 10# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 11#
 12# V - validated flag (1 - packet mark was validated locally; 0 - it wasn't)
 13# Realm - class of hosts the packet originated from (64 possible values)
 14
 15define MARK_MASK_REALM  = 0x0000003f
 16
 17define MARK_REALM_UNKNOWN   = 0  # Other provenance of packet
 18define MARK_REALM_LOCAL     = 1  # Packet from local host
 19define MARK_REALM_VIRT      = 2  # Packet from local VMs / containers
 20define MARK_REALM_LAN       = 3  # Packet from internal network
 21
 22define MARK_FLAG_VALIDATED  = 0x80000000
 23
 24# This empty definition is needed to allow the flush command to work if the
 25# table is not already defined.
 26table inet host-firewall; flush table inet host-firewall
 27
 28# Note that the flush command does not destroy the table or the objects
 29# contained within, only clearing the rules within all of the chains. Use the
 30# following instead, if the object definitions need to be changed, chains
 31# completely destroyed or sets/maps cleared.
 32#destroy table inet host-firewall
 33
 34table inet host-firewall {
 35    chain early-inbound {
 36        type filter hook prerouting priority raw; policy accept;
 37
 38        # When VRF interfaces are in use, packets go through the prerouting hook
 39        # twice, once with the VRF interface set as input and another time with
 40        # actual interface set as input.
 41        meta iifkind "vrf" return
 42
 43        # If the mark was previously set with the validated flag set (e.g.
 44        # decapsulated packet), reset it. This also resets the mark for remote
 45        # packets that automatically set the mark and attempt to forge the
 46        # validated flag (e.g. VXLAN with the GBP extension).
 47        (meta mark & $MARK_FLAG_VALIDATED) != 0 meta mark set 0
 48        meta mark != 0 jump mark-inbound-external-validate
 49        meta mark == 0 jump mark-inbound-determine
 50        meta mark set (meta mark | $MARK_FLAG_VALIDATED)
 51    }
 52
 53    chain mark-inbound-external-validate {
 54        # Do not allow externally-determined marks to have the realm set to
 55        # LOCAL or VIRT.
 56        meta mark & $MARK_MASK_REALM == {
 57            $MARK_REALM_LOCAL,
 58            $MARK_REALM_LAN,
 59        } drop
 60    }
 61
 62    chain mark-inbound-determine {
 63        # Set the realm to LOCAL for packets received on the loopback interface.
 64        meta iif $IF_LOOPBACK meta mark set $MARK_REALM_LOCAL return
 65
 66        # Set the realm to VIRT for packets received on bridge interfaces.
 67        meta iifkind "bridge" meta mark set $MARK_REALM_VIRT return
 68
 69        # Set the realm to LAN for link-local and private addresses.
 70        ip saddr {
 71            169.254.0.0/16,
 72            10.0.0.0/8,
 73            172.16.0.0/12,
 74            192.168.0.0/16,
 75        } meta mark set $MARK_REALM_LAN return
 76        ip6 saddr {
 77            fe80::/64,
 78            fc00::/7,
 79        } meta mark set $MARK_REALM_LAN return
 80    }
 81
 82    chain firewall-input {
 83        # Process packets destined for this host.
 84        type filter hook input priority filter;
 85        # Use a default-deny policy for packets.
 86        policy drop;
 87
 88        # Use conntrack state to allow packets belonging to already established
 89        # flows, while dropping packets which conntrack considers invalid.
 90        ct state established,related accept
 91        ct state invalid drop
 92
 93        # Allow traffic on the loopback interface(s).
 94        meta iif $IF_LOOPBACK accept
 95
 96        # Process multicast packets. Upon returning, do not evaluate any more
 97        # rules and apply the policy verdict (drop).
 98        meta pkttype multicast goto firewall-input-multicast
 99
100        # Drop-in files can add rules here.
101        include "/etc/nftables/input-rules.d/*.conf"
102    }
103
104    chain firewall-input-multicast {
105        # Allow any IPv4 IGMP.
106        ip protocol igmp accept
107
108        # Allow inbound Multicast DNS packets.
109        udp dport 5353 accept
110
111        # If no prior action was taken, this will return to the calling chain
112        # (firewall-input).
113    }
114}
115
116include "/etc/nftables/tables.d/*.conf"

Concatenations

Concatenations allow combining expressions into compound expressions that have a complex type, by using the . operator. These are particularly powerful when used in combination with sets and maps to define keys based on multiple attributes of a packet. For example, the following expression combines three different fields - the Netfilter mark, the transport protocol (meta l4proto matches irrespective of the encapsulating network protocol, IPv4 or IPv6), and the transport protocol destination port (th dport matches irrespective of the transport protocol, such as TCP, UDP or SCTP):

meta mark . meta l4proto . th dport

The use of binary operators for extracting information based on the Netfilter mark convention established earlier, along with anonymous sets for specifying alternative values and intervals result in powerful matching expressions:

(meta mark & $MARK_MASK_REALM) . meta l4proto . th dport {
    # Web service allowed from anywhere
    0-63                . tcp   . 80,
    # SSH allowed from local machine and local VMs
    $MARK_REALM_LOCAL   . tcp   . 22,
    $MARK_REALM_VIRT    . tcp   . 22,
    # SIP signalling allowed from LAN over any transport
    $MARK_REALM_LAN     . sctp  . 5060-5061,
    $MARK_REALM_LAN     . tcp   . 5060-5061,
    $MARK_REALM_LAN     . udp   . 5060-5061,
} accept

Sets

Sets are a generic data structure in nftables that act as a container for values with support for efficient lookup, addition and removal operations. They are similar to the ipset functionality available in xtables, but support arbitrary types via the use of Concatenations. The implementation uses hashtables and red-black trees. Sets come in two types:

  • Named sets: defined within tables and with an associated name, these allow both external applications, as well as nftables rules to manage the elements.

  • Anonymous sets: defined inline within rules, these allow the expression of the logical OR operator. The expression tcp dport { 80, 443 } matches if the TCP destination port is either 80 or 443.

Named sets, like other objects such as tables or chains, can be defined multiple times with an additive effect. This allows the sets’ elements to be added in multiple places, such as by using include directives with wildcards for drop-in files. Unlike anonymous sets, various configuration options can be added as part of the definition to control the behavior of the sets. These are all documented in the Sets section of the manual page, but some of the more useful ones are:

  • type or typeof: these are necessary for a named set and define the format of the elements. type requires the use of data type names, possibly with concatenations, while typeof receives an expression that is used to derive the elements’ type. The typeof configuration is particularly useful for expressions that have only an variable-length integer data type associated and cannot be be expressed with type (e.g. typeof meta cgroup cannot be expressed with type).

  • flags interval: allows the use of intervals in elements. An anonymous set that uses intervals effectively activates this flag, as well.

  • flags dynamic: allows the addition of elements from rules, using the set statement.

  • flags timeout: allows elements to be automatically removed after an interval has elapsed since the element was (last) added to the set.

  • timeout: expression that defines the default interval after which an element will be removed from the set. For example: timeout 5m for a 5-minute interval.

  • size: defines the maximum number of elements that the set can hold.

The following highlighted lines extend the example firewall configuration with:

  • A named set (input-services) for services allowed to the local host. These are defined based on the Netfilter mark (only the realm bits), the transport protocol and the transport destination port. An include directive facilitates the definition of additional services in drop-in files.

  • A rule to reference the new named set in the firewall-input base chain.

  • A new rule in the firewall-input-multicast regular chain that allows IPv6 Multicast Listener Discovery (MLD) and Neighbour Discovery (ND) ICMPv6 packets through, by using an anonymous set. These are generally required for the correct functioning of IPv6 in local networks.

/etc/nftables.conf
  1#!/usr/sbin/nft -f
  2
  3define IF_LOOPBACK = lo
  4
  5# The packet mark is interpreted as follows (big endian):
  6#    3                   2                   1                   0
  7#  1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
  8# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  9# |V| Unused                                          | Realm (6) |
 10# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 11#
 12# V - validated flag (1 - packet mark was validated locally; 0 - it wasn't)
 13# Realm - class of hosts the packet originated from (64 possible values)
 14
 15define MARK_MASK_REALM  = 0x0000003f
 16
 17define MARK_REALM_UNKNOWN   = 0  # Other provenance of packet
 18define MARK_REALM_LOCAL     = 1  # Packet from local host
 19define MARK_REALM_VIRT      = 2  # Packet from local VMs / containers
 20define MARK_REALM_LAN       = 3  # Packet from internal network
 21
 22define MARK_FLAG_VALIDATED  = 0x80000000
 23
 24# Note that the flush command does not destroy the table or the objects
 25# contained within, only clearing the rules within all of the chains. The
 26# destroy command is used in order to clear the sets' elements.
 27destroy table inet host-firewall
 28
 29table inet host-firewall {
 30    set input-services {
 31        type mark . inet_proto . inet_service
 32        flags interval
 33        elements = {
 34            # Web service allowed from anywhere
 35            0-63                . tcp   . 80,
 36            # SSH allowed from local machine and local VMs
 37            $MARK_REALM_LOCAL   . tcp   . 22,
 38            $MARK_REALM_VIRT    . tcp   . 22,
 39            # SIP signalling allowed from LAN over any transport
 40            $MARK_REALM_LAN     . sctp  . 5060-5061,
 41            $MARK_REALM_LAN     . tcp   . 5060-5061,
 42            $MARK_REALM_LAN     . udp   . 5060-5061,
 43        }
 44    }
 45    include "/etc/nftables/input-services.d/*.conf"
 46
 47    chain early-inbound {
 48        type filter hook prerouting priority raw; policy accept;
 49
 50        # When VRF interfaces are in use, packets go through the prerouting hook
 51        # twice, once with the VRF interface set as input and another time with
 52        # actual interface set as input.
 53        meta iifkind "vrf" return
 54
 55        # If the mark was previously set with the validated flag set (e.g.
 56        # decapsulated packet), reset it. This also resets the mark for remote
 57        # packets that automatically set the mark and attempt to forge the
 58        # validated flag (e.g. VXLAN with the GBP extension).
 59        (meta mark & $MARK_FLAG_VALIDATED) != 0 meta mark set 0
 60        meta mark != 0 jump mark-inbound-external-validate
 61        meta mark == 0 jump mark-inbound-determine
 62        meta mark set (meta mark | $MARK_FLAG_VALIDATED)
 63    }
 64
 65    chain mark-inbound-external-validate {
 66        # Do not allow externally-determined marks to have the realm set to
 67        # LOCAL or VIRT.
 68        meta mark & $MARK_MASK_REALM == {
 69            $MARK_REALM_LOCAL,
 70            $MARK_REALM_LAN,
 71        } drop
 72    }
 73
 74    chain mark-inbound-determine {
 75        # Set the realm to LOCAL for packets received on the loopback interface.
 76        meta iif $IF_LOOPBACK meta mark set $MARK_REALM_LOCAL return
 77
 78        # Set the realm to VIRT for packets received on bridge interfaces.
 79        meta iifkind "bridge" meta mark set $MARK_REALM_VIRT return
 80
 81        # Set the realm to LAN for link-local and private addresses.
 82        ip saddr {
 83            169.254.0.0/16,
 84            10.0.0.0/8,
 85            172.16.0.0/12,
 86            192.168.0.0/16,
 87        } meta mark set $MARK_REALM_LAN return
 88        ip6 saddr {
 89            fe80::/64,
 90            fc00::/7,
 91        } meta mark set $MARK_REALM_LAN return
 92    }
 93
 94    chain firewall-input {
 95        # Process packets destined for this host.
 96        type filter hook input priority filter;
 97        # Use a default-deny policy for packets.
 98        policy drop;
 99
100        # Use conntrack state to allow packets belonging to already established
101        # flows, while dropping packets which conntrack considers invalid.
102        ct state established,related accept
103        ct state invalid drop
104
105        # Allow traffic on the loopback interface(s).
106        meta iif $IF_LOOPBACK accept
107
108        # Process multicast packets. Upon returning, do not evaluate any more
109        # rules and apply the policy verdict (drop).
110        meta pkttype multicast goto firewall-input-multicast
111
112        # Allow services based on the origin realm, the transport protocol and
113        # the destination port.
114        (meta mark & $MARK_MASK_REALM) . meta l4proto . th dport @input-services accept
115
116        # Drop-in files can add rules here.
117        include "/etc/nftables/input-rules.d/*.conf"
118    }
119
120    chain firewall-input-multicast {
121        # Allow any IPv4 IGMP.
122        ip protocol igmp accept
123
124        # Allow IPv6 MLD (for multicast group management) and neighbour
125        # discovery (note that unicast packets would not be handled here).
126        icmpv6 type {
127            mld-listener-query,
128            mld-listener-report,
129            mld-listener-reduction,
130            mld2-listener-report,
131            nd-router-advert,
132            nd-neighbor-solicit,
133            nd-neighbor-advert,
134        } accept
135
136        # Allow inbound Multicast DNS packets.
137        udp dport 5353 accept
138
139        # If no prior action was taken, this will return to the calling chain
140        # (firewall-input).
141    }
142}
143
144include "/etc/nftables/tables.d/*.conf"

A drawback of the drop-in file configuration is that each file will have to redefine the set with the exact same settings:

/etc/nftables/input-services.d/ldap.conf
 1#!/usr/sbin/nft -f
 2
 3set input-services {
 4    type mark . inet_proto . inet_service
 5    flags interval
 6    elements = {
 7        # LDAP server access allowed from local machine and LAN
 8        $MARK_REALM_LOCAL   . tcp   . 389,
 9        $MARK_REALM_LAN     . tcp   . 389,
10    }
11}

Element management in rules

A named set’s elements can be dynamically added from rules by the use of the add or update statements. Their use requires the presence of the dynamic flag on the defined set. The major different between the two statements is that the add statement will conditionally terminate the rule early if the element already exists in the set, while the update statement will update any meta information associated with the element (such as resetting the timeout or modifying any stateful objects associated).

To demonstrate this functionality, the following example implements a very trivial rate limit for new connections, which only accepts new connections if the source address, transport protocol and destination port concatenation can be added to a set (i.e. if the element did not previously exist). Elements expire after 2 minutes, so this effectively applies a limit of one connection every 2 minutes. Note the size limit of the set, necessary in order to ensure that the amount of memory used is bounded. This is only meant for illustration of the add statement, with limits being the preferred means of implementing rate limiting - these are explained in the Stateful objects section.

/etc/nftables/tables.d/limits.conf
 1destroy table ip limits
 2table ip limits {
 3    set connections {
 4        type ipv4_addr . inet_proto . inet_service
 5        flags dynamic, timeout
 6        timeout 2m
 7        size 65536
 8    }
 9    chain limits-inbound {
10        # This must execute after conntrack lookup (priority -200).
11        type filter hook prerouting priority filter; policy drop;
12
13        # Only apply limits to packets that establish new flows.
14        ct state != new accept
15
16        # Accept packets that can be added to the set.
17        add @connections { ip saddr . meta l4proto . th dport } accept
18
19        # Anything that reaches here is dropped by policy.
20    }
21}

Similarly, the update statement can be used to add elements to a set, but will not fail if the element already exists. In the following example, the timeout is reset, which allows the set to track any IPv4 /24 prefix that initiated a new flow in the last 10 minutes.

/etc/nftables/tables.d/flow-track.conf
 1destroy table ip flow-track
 2table ip flow-track {
 3    set connections {
 4        type ipv4_addr
 5        flags dynamic, timeout
 6        timeout 10m
 7        size 65536
 8    }
 9    chain track-inbound {
10        # This must execute after conntrack lookup (priority -200).
11        type filter hook prerouting priority filter; policy accept;
12
13        # Only bother with packets that establish new flows.
14        ct state != new accept
15
16        # Add /24 prefix to the connections set.
17        ct state new update @connections { ip saddr & 255.255.255.0 }
18    }
19}

Element management in userspace

Named set elements can be inspected and managed from userspace, too. This is implemented over the standard AF_NETLINK communication for nftables, with the nft utility offering convenient commands, as described in the manual page’s Sets and Elements sections. The following are example commands that apply to the previously-defined connections set from the ip address family flow-track table.

Listing all elements in the set:

sudo nft list set ip flow-track connections

Removing all elements from the same set:

sudo nft flush set ip flow-track connections

Resetting any stateful objects associated with the elements in the set (such as counters or limits, explained in the Stateful objects section):

sudo nft reset set ip flow-track connections

Retrieving elements from a set:

sudo nft get element ip flow-track connections '{ 127.0.0.0 }'

Adding an element to a set (alternatively, the similar create command succeeds only if the element does not exist):

sudo nft add element ip flow-track connections '{ 10.0.0.0 }'

Deleting an element from a set (alternatively, the similar destroy command succeeds even if the element does not exist):

sudo nft delete element ip flow-track connections '{ 10.0.0.0 }'

Maps

Maps are nftables data structures that associate keys to values, a form of associative arrays or dictionaries. Maps are similar to sets: in fact, sets are implemented as maps, with elements being keys without associated values. As such, maps usage is very similar to that of sets, including:

  • anonymous maps and named maps;

  • similar configuration settings for named maps;

  • ability to manage elements from other applications or rules (the same statements and commands are used - see the Sets section for more details).

Maps support a lookup operation, the map statement, that returns the value associated with a lookup key. The returned value can then be used as an expression with the same type as the value type of the map. If a corresponding key is not found, the statement terminates rule evaluation early and no further statements are evaluated.

The highlighted lines in the following extension of the example make use of a map from IPv4 and IPv6 prefixes to Netfilter marks representing the origin realm and sets the Netfilter mark using an extensible rule: new elements can be added to map other addresses to different Netfilter mark values. In addition to the two new maps (ip4-known-addresses and ip6-known-addresses), the changes are made to the mark-inbound-determine regular chain. The rule meta mark set ip saddr map @ip4-known-addresses return can be broken down as:

  1. Form the key for the map lookup: ip saddr. This implies that the network protocol must be IPv4. For any other packets, the rule is terminated early and evaluation continues with the next rule in the chain.

  2. Lookup the key in the @ip4-known-addresses map: ip saddr map @ip4-known-addresses. The value type of the map is a Netfilter mark. If no key is found, the rule is terminated early and evaluation continues with the next rule in the chain.

  3. Set the packet’s Netfilter mark to the value returned by the map lookup: meta mark set ip saddr map @ip4-known-addresses.

  4. Return from the current chain: return. This is only executed if the previous ip saddr expression or the map statement did not terminate evaluation of the rule, either because the packet was not IPv4 or the key could not be found.

/etc/nftables.conf
  1#!/usr/sbin/nft -f
  2
  3define IF_LOOPBACK = lo
  4
  5# The packet mark is interpreted as follows (big endian):
  6#    3                   2                   1                   0
  7#  1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
  8# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  9# |V| Unused                                          | Realm (6) |
 10# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 11#
 12# V - validated flag (1 - packet mark was validated locally; 0 - it wasn't)
 13# Realm - class of hosts the packet originated from (64 possible values)
 14
 15define MARK_MASK_REALM  = 0x0000003f
 16
 17define MARK_REALM_UNKNOWN   = 0  # Other provenance of packet
 18define MARK_REALM_LOCAL     = 1  # Packet from local host
 19define MARK_REALM_VIRT      = 2  # Packet from local VMs / containers
 20define MARK_REALM_LAN       = 3  # Packet from internal network
 21
 22define MARK_FLAG_VALIDATED  = 0x80000000
 23
 24# Note that the flush command does not destroy the table or the objects
 25# contained within, only clearing the rules within all of the chains. The
 26# destroy command is used in order to clear the sets' elements.
 27destroy table inet host-firewall
 28
 29table inet host-firewall {
 30    set input-services {
 31        type mark . inet_proto . inet_service
 32        flags interval
 33        elements = {
 34            # Web service allowed from anywhere
 35            0-63                . tcp   . 80,
 36            # SSH allowed from local machine and local VMs
 37            $MARK_REALM_LOCAL   . tcp   . 22,
 38            $MARK_REALM_VIRT    . tcp   . 22,
 39            # SIP signalling allowed from LAN over any transport
 40            $MARK_REALM_LAN     . sctp  . 5060-5061,
 41            $MARK_REALM_LAN     . tcp   . 5060-5061,
 42            $MARK_REALM_LAN     . udp   . 5060-5061,
 43        }
 44    }
 45    include "/etc/nftables/input-services.d/*.conf"
 46
 47    map ip4-known-addresses {
 48        type ipv4_addr : mark
 49        flags interval
 50        elements = {
 51            # link-local addresses
 52            169.254.0.0/16  : $MARK_REALM_LAN,
 53            # RFC1918 private addresses
 54            10.0.0.0/8      : $MARK_REALM_LAN,
 55            172.16.0.0/12   : $MARK_REALM_LAN,
 56            192.168.0.0/16  : $MARK_REALM_LAN,
 57        }
 58    }
 59
 60    map ip6-known-addresses {
 61        type ipv6_addr : mark
 62        flags interval
 63        elements = {
 64            # link-local addresses
 65            fe80::/64   : $MARK_REALM_LAN,
 66            # RFC4193 local addresses
 67            fc00::/7    : $MARK_REALM_LAN,
 68        }
 69    }
 70
 71    chain early-inbound {
 72        type filter hook prerouting priority raw; policy accept;
 73
 74        # When VRF interfaces are in use, packets go through the prerouting hook
 75        # twice, once with the VRF interface set as input and another time with
 76        # actual interface set as input.
 77        meta iifkind "vrf" return
 78
 79        # If the mark was previously set with the validated flag set (e.g.
 80        # decapsulated packet), reset it. This also resets the mark for remote
 81        # packets that automatically set the mark and attempt to forge the
 82        # validated flag (e.g. VXLAN with the GBP extension).
 83        (meta mark & $MARK_FLAG_VALIDATED) != 0 meta mark set 0
 84        meta mark != 0 jump mark-inbound-external-validate
 85        meta mark == 0 jump mark-inbound-determine
 86        meta mark set (meta mark | $MARK_FLAG_VALIDATED)
 87    }
 88
 89    chain mark-inbound-external-validate {
 90        # Do not allow externally-determined marks to have the realm set to
 91        # LOCAL or VIRT.
 92        meta mark & $MARK_MASK_REALM == {
 93            $MARK_REALM_LOCAL,
 94            $MARK_REALM_LAN,
 95        } drop
 96    }
 97
 98    chain mark-inbound-determine {
 99        # Set the realm to LOCAL for packets received on the loopback interface.
100        meta iif $IF_LOOPBACK meta mark set $MARK_REALM_LOCAL return
101
102        # Set the realm to VIRT for packets received on bridge interfaces.
103        meta iifkind "bridge" meta mark set $MARK_REALM_VIRT return
104
105        # Set the realm for known addresses.
106        meta mark set ip saddr map @ip4-known-addresses return
107        meta mark set ip6 saddr map @ip6-known-addresses return
108    }
109
110    chain firewall-input {
111        # Process packets destined for this host.
112        type filter hook input priority filter;
113        # Use a default-deny policy for packets.
114        policy drop;
115
116        # Use conntrack state to allow packets belonging to already established
117        # flows, while dropping packets which conntrack considers invalid.
118        ct state established,related accept
119        ct state invalid drop
120
121        # Allow traffic on the loopback interface(s).
122        meta iif $IF_LOOPBACK accept
123
124        # Process multicast packets. Upon returning, do not evaluate any more
125        # rules and apply the policy verdict (drop).
126        meta pkttype multicast goto firewall-input-multicast
127
128        # Allow services based on the origin realm, the transport protocol and
129        # the destination port.
130        (meta mark & $MARK_MASK_REALM) . meta l4proto . th dport @input-services accept
131
132        # Drop-in files can add rules here.
133        include "/etc/nftables/input-rules.d/*.conf"
134    }
135
136    chain firewall-input-multicast {
137        # Allow any IPv4 IGMP.
138        ip protocol igmp accept
139
140        # Allow IPv6 MLD (for multicast group management) and neighbour
141        # discovery (note that unicast packets would not be handled here).
142        icmpv6 type {
143            mld-listener-query,
144            mld-listener-report,
145            mld-listener-reduction,
146            mld2-listener-report,
147            nd-router-advert,
148            nd-neighbor-solicit,
149            nd-neighbor-advert,
150        } accept
151
152        # Allow inbound Multicast DNS packets.
153        udp dport 5353 accept
154
155        # If no prior action was taken, this will return to the calling chain
156        # (firewall-input).
157    }
158}
159
160include "/etc/nftables/tables.d/*.conf"

Verdict maps

Verdict maps are similar to maps, with the primary difference being that the values associated to keys are verdict statements. This allows them to be used with the vmap statements, which looks up a key in the map and, if the key is found, executes the associated verdict statement; if the key is not found, the next rule in the chain is evaluated (an implied continue statement). The vmap statement is terminal (i.e. it must be the last statement in a rule).

This functionality allows efficient branching decisions to be made. For example, the above rules in the early-inbound chain that either validate an external Netfilter mark or determine one locally if it is not already set, can be rewritten with a vmap statement. It should be noted that the two rules might still be more efficient than the use of a red-black tree for this simple branch.

Rule extract from ‘table inet host-firewall’ ‘chain early-inbound’.
# Instead of the following two rules...
#meta mark != 0 jump mark-inbound-external-validate
#meta mark == 0 jump mark-inbound-determine

# ... use a vmap statement:
meta mark vmap {
    0:              jump mark-inbound-determine,
    1-0xFFFFFFFF:   jump mark-inbound-external-validate,
}

Stateful objects

Stateful objects allow the tracking of information across unrelated packets, in order to implement functionality such as:

  • byte and packet counters: count number of packets or total size of packets that pass through a rule;

  • byte quotas: count total size of packets that pass through a rule and execute statements when the size is either below or above a threshold (e.g. allow only a particular amount of data to be transferred across different flows);

  • limits: a packet count or packet size token bucket rate limiter that allows statements to be executed whenever the rate of packets or data transferred is below or above a certain threshold;

  • connection limits: integrated with Netfilter’s conntrack, allows statements to be executed when the number of matching flows is above or below a certain threshold.

Stateful objects come in two types, dictating whether the state is shared across different contexts:

  • Named objects are associated with a particular table and can be referenced by more than one rule. As such, the same data can be interrogated or modified from multiple places (e.g. the same counter incremented by more than one rule). These can also be queried or managed from userspace.

  • Anonymous objects do not have an associated name and are bound to single context: a particular rule or a set’s key. In particular, connection limit objects, as explained in the nftables documentation, can only be anonymous objects.

The creation of named stateful objects follows the same convention as for all other objects. For example, the following file creates a counter which is referenced once the end of the limits-inbound chain is reached:

/etc/nftables/tables.d/limits.conf
 1destroy table inet limits
 2table inet limits {
 3    counter dropped-flows {
 4    }
 5    chain limits-inbound {
 6        # This must execute after conntrack lookup (priority -200).
 7        type filter hook prerouting priority filter; policy accept;
 8
 9        # Only apply limits to packets that establish new flows.
10        ct state != new accept
11
12        # Anything that reaches here is dropped.
13        counter name dropped-flows drop
14    }
15}

Retrieving the counter data:

sudo nft list counter ip limits dropped-flows

Resetting the counter data:

sudo nft reset counter ip limits dropped-flows

Anonymous objects can be associated to rules. The highlighted lines in the following extension to the previous example add some functionality:

  • For HTTP traffic (selected using tcp dport { 80, 443 }), use an anonymous connection limit (ct count over 1000) which conditionally terminates the rule evaluation if the threshold isn’t reached, stopping the subsequent statements in the rule from being executed (counter name dropped-flows and drop). Any conntrack flows which match the selection criteria are kept track of by the connection limit; as soon as a flow stops being tracked by conntrack, it is removed from this connection limit.

  • For non-HTTP traffic, similarly use an anonymous connection limit to drop new flow-initiating packets once the threshold (500 flows) is reached.

  • For HTTP traffic, use an anonymous limit (limit rate over 20/minute burst 500 packets) which conditionally terminates rule evaluation if the threshold isn’t reached, stopping the subsequent statements in the rule from being executed (counter name dropped-flows and drop). A token bucket rate limiter is updated for each packet that matches the rule.

  • For non-HTTP traffic, similarly use an anonymous limit to drop new flow-initiating packets once the token bucket’s threshold is reached.

Note that new flow-initiating packets are dropped if either of the criteria is met: active number of conntrack flows or rate of creation of new conntrack flows.

/etc/nftables/tables.d/limits.conf
 1destroy table inet limits
 2table inet limits {
 3    counter dropped-flows {
 4    }
 5
 6    chain limits-inbound {
 7        # This must execute after conntrack lookup (priority -200).
 8        type filter hook prerouting priority filter; policy accept;
 9
10        # Only apply limits to packets that establish new flows.
11        ct state != new accept
12        # Do not apply limits to local and VMs communication.
13        meta mark & $MARK_MASK_REALM {
14            $MARK_REALM_LOCAL,
15            $MARK_REALM_VIRT
16        } accept
17
18        jump drop-on-flow-count
19        jump drop-on-new-flow-rate
20    }
21
22    chain drop-on-flow-count {
23        # Allow at most 1000 simultaneous flows for HTTP.
24        tcp dport { 80, 443 } ct count over 1000 \
25            counter name dropped-flows \
26            drop
27        # If this rule is reached, the above threshold did not get reached, so
28        # return in order to avoid counting this traffic towards subsequent
29        # limits.
30        tcp dport { 80, 443 } return
31
32        # Allow at most 500 simultaneous flows for everything else.
33        ct count over 500 \
34            counter name dropped-flows \
35            drop
36    }
37
38    chain drop-on-new-flow-rate {
39        # Allow at most 20 new flows per minute (with a burst of 500) for HTTP.
40        tcp dport { 80, 443 } limit rate over 20/minute burst 500 packets \
41            counter name dropped-flows \
42            drop
43        # If this rule is reached, the above threshold did not get reached, so
44        # return in order to avoid counting this traffic towards subsequent
45        # limits.
46        tcp dport { 80, 443 } return
47
48        # Allow at most 10 new flows per minute (with a burst of 100) for
49        # everything else.
50        limit rate over 10/minute burst 100 packets \
51            counter name dropped-flows \
52            drop
53    }
54}

Anonymous stateful objects associated to set elements are created via an extension to the add and update statements that follows the same syntax as the rule expressions. This allows individual stateful objects to be associated with an arbitrary set of criteria. In effect, this extended syntax transforms the add and update statements to a lookup operation for a stateful object associated to a key and can terminate a rule early (e.g. if a token bucket limiter’s threshold is reached). The following example adds support for tracking flow count and new flow rate per subnet in the highlighted lines:

  • For tracking flow counts, two sets are created: flow-count-ip4 and flow-count-ip6. Elements with IPv4 /24 subnets and IPv6 /48 subnets are added to these sets with associated connection limits. Rule evaluation is terminated early if the threshold isn’t reached; if it is, the new flow-initiating packet is dropped. Elements are automatically removed from the sets when there are no more conntrack entries associated.

  • For tracking flow rates, two sets are created: flow-rate-ip4 and flow-rate-ip6. Elements with IPv4 /24 subnets and IPv6 /48 subnets are added to these sets with associated limit stateful objects. The update operation effectively refreshes the timeout after which elements from the sets expire (1 minute). Rule evaluation is terminated early if the threshold isn’t reached; if it is, the prefix is added to a blocklist and the new flow-initiating packet is dropped. Elements are automatically removed from the sets when they are not refreshed (i.e. if no new flow-initiating packet for the given subnet is received within the set’s 1 minute timeout).

  • The blocklist sets, blocklist-ip4 and blocklist-ip6 contain IPv4 /24 subnets and IPv6 /48 subnets which have exceeded the connection rate thresholds. Elements expire from these sets after 10 minutes, during which no new flows are allowed. It would certainly be possible to not use these extra sets, in which case a new flow could be established as soon as the token bucket rate limiter would allow it.

  • The previously demonstrated global restrictions, applicable to any IPv4 or IPv6 source address, are maintained.

/etc/nftables/tables.d/limits.conf
  1destroy table inet limits
  2table inet limits {
  3    counter dropped-flows {
  4    }
  5
  6    set blocklist-ip4 {
  7        type ipv4_addr
  8        flags dynamic
  9        timeout 10m
 10        size 65536
 11    }
 12    set flow-rate-ip4 {
 13        type ipv4_addr
 14        flags dynamic, timeout
 15        timeout 1m
 16        size 65536
 17    }
 18    set flow-count-ip4 {
 19        type ipv4_addr
 20        flags dynamic
 21        size 65536
 22    }
 23
 24    set blocklist-ip6 {
 25        type ipv6_addr
 26        flags dynamic
 27        timeout 10m
 28        size 65536
 29    }
 30    set flow-rate-ip6 {
 31        type ipv6_addr
 32        flags dynamic, timeout
 33        timeout 1m
 34        size 65536
 35    }
 36    set flow-count-ip6 {
 37        type ipv6_addr
 38        flags dynamic
 39        size 65536
 40    }
 41
 42    chain limits-inbound {
 43        # This must execute after conntrack lookup (priority -200).
 44        type filter hook prerouting priority filter; policy accept;
 45
 46        # Only apply limits to packets that establish new flows.
 47        ct state != new accept
 48        # Do not apply limits to local and VMs communication.
 49        meta mark & $MARK_MASK_REALM {
 50            $MARK_REALM_LOCAL,
 51            $MARK_REALM_VIRT
 52        } accept
 53
 54        jump drop-on-flow-count
 55        jump drop-on-new-flow-rate
 56    }
 57
 58    chain drop-on-flow-count {
 59        # Allow at most 50 simultaneous flows per IPv4 /24 subnet or IPv6 /48
 60        # subnet.
 61        add @flow-count-ip4 { ip saddr & 255.255.255.0 ct count over 50 } \
 62            counter name dropped-flows \
 63            drop
 64        add @flow-count-ip6 { ip6 saddr & ffff:ffff:ffff:: ct count over 50 } \
 65            counter name dropped-flows \
 66            drop
 67
 68        # Allow at most 1000 simultaneous flows for HTTP.
 69        tcp dport { 80, 443 } ct count over 1000 \
 70            counter name dropped-flows \
 71            drop
 72        # If this rule is reached, the above threshold did not get reached, so
 73        # return in order to avoid counting this traffic towards subsequent
 74        # limits.
 75        tcp dport { 80, 443 } return
 76
 77        # Allow at most 500 simultaneous flows for everything else.
 78        ct count over 500 \
 79            counter name dropped-flows \
 80            drop
 81    }
 82
 83    chain drop-on-new-flow-rate {
 84        # Drop packets from IPv4 /24 subnets that have been added to the
 85        # blocklist.
 86        ip saddr & 255.255.255.0 @blocklist-ip4 \
 87            counter name dropped-flows \
 88            drop
 89        # Update token bucket rater limiter per IPv4 /24 subnets; if over the
 90        # threshold, add the subnet to the blocklist and drop the packet.
 91        update @flow-rate-ip4 { \
 92                ip saddr & 255.255.255.0 \
 93                limit rate over 5/second burst 50 packets \
 94            } \
 95            add @blocklist-ip4 { ip saddr & 255.255.255.0 } \
 96            counter name dropped-flows \
 97            drop
 98
 99        # Drop packets from IPv6 /48 subnets that have been added to the
100        # blocklist.
101        ip6 saddr & ffff:ffff:ffff:: @blocklist-ip6 \
102            counter name dropped-flows \
103            drop
104        # Update token bucket rater limiter per IPv6 /48 subnets; if over the
105        # threshold, add the subnet to the blocklist and drop the packet.
106        update @flow-rate-ip6 { \
107                ip6 saddr & ffff:ffff:ffff:: \
108                limit rate over 5/second burst 50 packets \
109            } \
110            add @blocklist-ip6 { ip6 saddr & ffff:ffff:ffff:: } \
111            counter name dropped-flows \
112            drop
113
114        # Allow at most 20 new flows per minute (with a burst of 500) for HTTP.
115        tcp dport { 80, 443 } limit rate over 20/minute burst 500 packets \
116            counter name dropped-flows \
117            drop
118        # If this rule is reached, the above threshold did not get reached, so
119        # return in order to avoid counting this traffic towards subsequent
120        # limits.
121        tcp dport { 80, 443 } return
122
123        # Allow at most 10 new flows per minute (with a burst of 100) for
124        # everything else.
125        limit rate over 10/minute burst 100 packets \
126            counter name dropped-flows \
127            drop
128    }
129}

Flowtables

Flowtables are a feature that accelerate packet forwarding for hosts that act as a layer 3 router or layer 2 bridge. The functionality acts as a cache for determining if a packet belongs to a known flow and can be almost directly sent out on a network interface after it is received by a (potentially different) network interface. As such, the feature cannot be used for flows for which one end is a local process via a socket. The flow is identified by an input interface, together with layer 2, layer 3 and layer 4 information, such as PPPoE session, source and destination network addresses, or source and destination transport protocol ports.

For matching flows, the forwarding stack is completely bypassed, including most Netfilter hooks and bridging or routing decisions, up until the point where a layer 2 address needs to be determined (via IPv4 ARP or IPv6 NDP). Queuing disciplines are still applied, allowing for traffic shaping. The forwarding logic bypass may be problematic for dynamic setups where the cached information can become stale (e.g. layer 3 forwarding integrated with layer 2 bridging, if the next hop could move to a different bridge port).

Flowtables are integrated with conntrack, with the flowtable fast path only being activated once a flow has seen packets in both directions. The flows are then refreshed at short, fixed intervals, as long as packets are received and the conntrack timeouts do not occur. The refresh intervals can be configured for TCP and UDP via the net.netfilter.nf_flowtable_tcp_timeout and net.netfilter.nf_flowtable_udp_timeout sysctls, but are otherwise fixed in other cases (30 seconds as of Linux 6.15). The conntrack state is synchronized according to the received packets.

Certain network interface cards (NICs) also support hardware offload of the flow information, further optimizing packet forwarding. When this is not available or not enabled (via the offload flag on the flowtable), software offload in the Linux kernel is used instead.

Flowtables are implemented by using an ingress hook. The priority associated with a flowtable is in effect the ingress hook priority at which the lookup is performed: this means that any chains registered at a lower numerical priority value would still process a packet matching a flowtable, while chains at a higher numerical priority value would not, nor would chains registered at any other Netfilter hook point, apart from egress. More information on flowtables can be found in the Netfilter documentation and Linux kernel documentation.

The following example enables accelerated forwarding for packets between a set of interfaces. The actual network interface which receive a packet needs to be registered, even if bridging is enabled.

/etc/nftables/tables.d/flow-offload.conf
 1define LAN_DEVICES = { eth0, eth1, eth2 }
 2
 3destroy table inet flow-offload
 4table inet flow-offload {
 5    flowtable lan-forwarding {
 6        hook ingress priority 0;
 7        devices = $LAN_DEVICES;
 8    }
 9    chain offload-forward {
10        type filter hook forward priority filter; policy accept;
11
12        # Only offload UDP packets.
13        meta l4proto udp flow add @lan-forwarding
14    }
15}

Other features

FIB lookup and reverse-path filtering

Forwarding Information Base (FIB) is a generic term for a lookup table used in network layer 2 or layer 3 packet forwarding (switching or routing). In this context, it refers to the Linux Routing Policy Database (RPDB), which manages the layer 3 routing decision lookup structures. The fib expression can be used in rules to perform route lookups and make decisions based on the result. Its syntax is flexible and supports several invocation types, as described in the manual page.

One possible use is to determine if a network address is local to the host. Such a test can be performed early, before a routing decision for the packet is made, and need not take the Netfilter mark into account, as the local routing table is usually the first one queried. The following example counts packets for which the network layer destination address (fib daddr) is either a local or broadcast one.

/etc/nftables/tables.d/test-firewall.conf
 1#!/usr/sbin/nft -f
 2
 3destroy table inet test-firewall
 4table inet test-firewall {
 5    chain test-prerouting {
 6        type filter hook prerouting priority filter; policy accept;
 7
 8        fib daddr type { local, broadcast } counter
 9    }
10}

A different use case for the route lookups is to perform reverse-path filtering: dropping packets if they arrive on an interface that is not associated with a route that covers the network source address of the packet. If policy routing is employed, different pieces of information may be used to perform routing table selection, in which case the reverse-path filter check may not be accurate; the Netfilter mark can be taken into account by fib expressions, which should cover the majority of the policy routing configurations. The following example demonstrates this use case. It should be noted that:

  • The Netfilter mark must be set prior to the fib expression being employed, as if it were set for a packet before a routing decision in the reverse direction (i.e. a packet with a destination address set to the source address of the reverse-path verified packet).

  • IPv6 configurations usually employ the same link-local prefix (fe80::/64) on all interfaces, requiring special handling.

/etc/nftables/tables.d/test-firewall.conf
 1#!/usr/sbin/nft -f
 2
 3destroy table inet test-firewall
 4table inet test-firewall {
 5    chain test-prerouting {
 6        type filter hook prerouting priority filter; policy accept;
 7
 8        # Must determine Netfilter mark as if for reverse direction here.
 9
10        # Chain will drop packets which do not pass the reverse-path filter check
11        jump rp-filter
12    }
13
14    chain rp-filter {
15        # Ignore IPv6 packets with a link-local source address.
16        ip6 saddr fe80::/64 return
17        # FIB expression with oif output will return 0 if interface cannot
18        be determined.
19        fib saddr . mark . iif oif 0 drop
20    }
21}

Payload expressions

Payload expressions allow selecting specific fields within a packet. The nft utility supports a large number of such expressions for simple rule management, as documented in the Payload Expressions section of the manual page.

Where these do not suffice or deep packet inspection (DPI) is necessary, raw payload expressions can be used: these allow selecting arbitrary parts (up to 128 bits long) of a packet from a fixed offset off one of the layer 2, layer 3 or layer 4 headers. In fact, the symbolic payload expressions (such as icmpv6 taddr, which selects the target address of NDP or MLD IPv6 packets) are translated by the userspace utility to raw payload expressions, not requiring any special support in the nftables Linux kernel component. The general format used is @base,offset,length, with offset and length defined in bits. The following are the bases off which the offsets can be calculated:

Base

Description

Example

@ll

link layer (e.g. Ethernet header)

@ll,0,48 is 48 bits from the start of the link layer header - for Ethernet, the destination MAC address

@nh

network header (e.g. IPv4, IPv6)

@nh,48,8 is 8 bits from 48 bits off the start of the network header - for IPv6, the next header field

@th

transport header (e.g. TCP, UDP)

@th,110,2 is 2 bits from 110 bits off the start of the transport header - for TCP, the SYN and FIN flags

@ih

inner header (payload after transport header)

@ih,8,16 is 16 bits from 8 bits off the start of the payload encapsulated in the transport protocol - for TLS, this is the version

It should be noted that, for the userspace nft utility, there are distinctions between using the raw payload expressions and the symbolic ones, such as:

  • An expression such as ip6 nexthdr will have an implied meta nfproto ip6 if the expression is added to a inet or bridge table, as packets which are not IPv6 would traverse these, too. @nh,48,8 will take the 7th byte from the start of the network header, whatever protocol that may be (e.g. IPv4, ARP, etc.).

  • The nft utility performs type checking. icmpv6 taddr has an ipv6_addr type (in userspace), allowing comparisons with IPv6 addresses (including prefixes, such as fe80::/64). @th,64,128 is treated as an 128-bit integer, even if preceded by an expression such as icmpv6 type nd-neighbor-advert, which would imply that the 128 bits would be supposed to be interpreted as the IPv6 NDP target address. As such, the use of raw payload expressions for set/map lookups or comparisons may be limited.