Hosting companies differ by offers, size, etc, but also by choices of infrastructure. But the latter, which are a major vector of the efficiency of a hosting offer, are seldom well known. Thus, NBS System launches a series of articles introducing its used infrastructures and tools, for more transparency and to make this hidden part of the IT hosting sector known. After our article about reverse proxies, we focus today on firewalls and load balancers.
Role and functionning of a firewall
A firewall is a toll booth, a compulsory transit point to control and drive the flows inbetween networks. To do that, a firewall will open and read packets at the IP level (Internet Protocol, on the network layer).
To check whether a packet can go on, and to know where to send it if need be, the firewall will thus open the packet and look at the following information: source IP address, destination IP address, used protocol (ICMP, TCP, UDP, etc.). If the packet uses the TCP or UDP protocols, then the firewall will also check the source and destination ports, which will bring even more information.
This information will be compared to a series of filtrating rules arbitrating whether the packet will be accepted or rejected. Each packet thus goes through many tests (one per rule): if it does not match a rule, it will be tested with the next one. If it matches, it will be either accepted or dropped, thus lost. If, at the end of all the tests, it did not match any rule, it will be dropped by default. One “best practice” to set up on firewalls is to drop any connection that is not specifically authorized.
If the packet is accepted and transmitted, its answer must also be dealt with! That is why firewalls can track connections: they are called in this case “stateful”, they do “conntrack” (CONNecion TRACKing). It means that when they receive and interpret a packet, they first check if a connection already exists between the source IP and the destination IP within the same protocol, ie if it already dealt with packets transiting between the two entities, with the same characteristics. Thus, a packet containing the answer to a request sent in a previous packet is accepted by default: the connection existed and was thus pre-validated by the firewall. It is important to note that every time the firewall receives a new packet corresponding to an existing connection, it updates the recording of this connection: it thus has the history of all exchanges.
These connection trackings have a limited Time To Live (TTL). When this TTL is reached, the firewall will close the connection, to avoid keeping records of connections that are too old, in order not to overburden the RAM. If a connection was closed or does not exist, the firewall will create it.
NAT : Network Address Translation
This tracking principle notably allows to use NAT (Network Address Translation). It consists in rewriting packets on the fly, by replacing an IP address with another one, whether the source IP address, the destination IP address or both. Let us take an example*, illustrated by the opposite diagram.
The firewall receives a packet from the machine A, whose IP address will be named IPa, the recipient being the public IP address IPb, corresponding to the service called B. Knowing this IP address and what it matches, the firewall will replace, within the packet, this destination IP address by VIPb, the private Virtual IP corresponding, internally, to the B service. Only afterwards will it test the addresses.
When it receives the response from the B service, the destination IP is IPa, and the source IP is VIPb. Given that it comes from one of our services, it will be accepted by default. Before sending the packet back, the firewall will transform VIPb into IPb, for the virtual IP not to be known by the Internet (it is private and must remain private). It can however only do so if it recorded the connection, as well as the details of the changes beforehand, in order to be able to restore the right IP addresses, understandable by A, when the response packet returns.
Place within an infrastructure
On most infrastructures, one can find one firewall per client. The benefits from this configuration is that the machine’s resources have no problem handling the client’s traffic. However, there is a drawback: if a change has to be made on the firewalls, then they must be modified one by one, which can take quite some time.
In NBS System’s infrastructure
Rather than one firewall per client, we chose to pool our resources, while keeping the security brought by firewalls. Thus, we have 6 firewalls, 2 per zone : Equinix, Iliad (our datacenters) and CerberHost (for our clients benefiting from this very high security Cloud, distributed across the two aforementionned datacenters).
Why two firewall per zone?
It is vital for us to garantee a redundancy of our equipments. Thus, our 3 zones each have two firewalls set in a master/slave configuration. The master firewall deals with all the traffic of its zone, while the slave is there for backup: it will gather the traffic if the master breaks. The only drawback from this configuration is that it requires a lot of quite heavy procedures to ensure the slave firewall is able to deal with the traffic if need be, for instance through numerous and reccuring redundancy tests.
This choice of a pooled firewall frees us from the constraint of having to modify, one by one, a large number of firewalls if a change in their configuration is needed. However, there is a challenge concerning the resources: the firewalls must have the capacity to deal with all of the traffic of the websites they protect. For them to be able to do just that, we optimized them in several ways that we detailed below.
Firewalls use a lot of CPU process time. However, if a firewall’s CPU is overloaded, it will drop all the packets it was handling, which is a considerable loss: it is thus vital that the CPU is not overused. For that, we optimized our CPUs, but also our network cards.
What weighs the most is indeed not the reading and interpreting of the packets, but the interrupts sent by the network card. Every time a packet is received, the network card will literally interrupt the CPU: it warns the latter that it received a packet, will open it and will need the CPU to read and interpret it. The CPU is then waiting, it cannot do anything else until the network card sends some more information. Knowing that each packet implies an interrupt, the processing time can be quite long…
Thus, the quality and characteristics of both the network card and the CPU matter. Optimizing only the network card without working on the CPU is useless, and the other way around also: we will see why.
We chose CPUs of 3GHz (GigaHertz): it corresponds to 3 million operations per second, which allows to do calculations quickly (for instance, interrupts). But that is not all: this must be seen together with the number of CPUs in the machine, as well as the number and characteristics of their cores.
We set up, on our firewalls, two CPUs, each containing 6 hyper-threaded cores. This hyper-threading characteristic consists in placing, on each core, two processors; thus, each core can treat two tasks in parallel. This leads us to benefiting from 24 threads per firewall. Thus, the CPUs offer 24 queues in total (one for each thread).
But this is not enough. Indeed, the “simple” calculations needed by the firewall automatically fan out to the 24 queues, but the interrupts do not. In the case of a classic network card, they are by default dealt with by only one thread: the network card only offers one queue. If nothing is changed, there is no point in having several queues on the CPU, since the network card will always “pick” the same one, and the following interrupt will have to wait for the thread to be freed in order to be treated. However, all recent network cards make it possible to multiply queues. We thus configured our network cards (one per CPU) so that they each have 12 queues, which adds up to 24 in total, allowing an optimized connection between CPUs and network cards. Interrups fan out on all of the resources, are treated in parallel, which enables us to exploit the capacities of our network cards at the maximum and to save CPU time.
It is also important to note that 4 years ago, when we set these firewalls up, our goal was to be able to handle 10G of traffic per firewall. Thanks to our investing in 4x10G network cards, it is this latter capacity that we are able to deal with, on each firewall.
As explained in the first part of the article, packets reaching firewalls go through a series of consecutive tests determining if they are accepted or dropped. However, classic firewalls have an average of 80.000 rules, which means 80.000 tests to go through, which is enormous. It means, for instance, that a packet matching the last rule went through 79.999 “useless” tests before being accepted! It is a huge loss of CPU time.
As a consequence, we wanted to change this structure, not to exceed 200 tests per paquet, in order to accelerate their treatment and to limit the work of the CPUs.
Our firewalls work on Linux. We use a module of the Linux kernel allowing us to set up filtrating rules, called Netfilter, as well as its overlayer, IPtables, enabling us to configure it. Netfilter can create sub-chains, and we use it to organize our rules in the form of a research tree.
Our firewall’s rules are thus structured in the following way:
- If the packet comes from a precise part of our infrastructure services (ex: monitoring) to a monitoring port on any machine in our park, it will be accepted by default.
- If it goes to one of our infrastructure services, from a delimited list (DNS, mail, NTP…), then it will also be accepted, no matter where it comes from.
- FROM table: output filtering tests. Only the source IP is controlled. In this table are many testing branches, leading either to the dropping of the packet or to the next table. No packet can be directly accepted in this table.
- TO table: input filtering tests. Only the destination IP is controlled. Here also are many test branches, but the packet is either dropped or accepted. It is the last testing step.
This structure enables us to correctly link the different rules. But that is not all: within the tables themselves, we tried many possibilities in order to find the configuration giving the best results. Indeed, we had to find an equilibrium between the number of tests per branch and the number of branches, since the jump from one branch to another also has a cost in terms of CPU time.
IP addesses are made of 32 bits. We played on the number of bits tested in each branch to find the best configuration possible. For instance, to test 512.000 IP addresses, the best choice is to split the IP address analysis by testing 3 bits by 3 bits. It provides a result of 8 rules per branch, with 8 jumps, and a total of 51 tests (évaluation / paquet) at the most per IP address. This configuration allows us to test up to 4,2 million packets per second. To compare, testing 2 bits by 2 bits (4 rules per branch and 39 tests at the most, but 11 jumps) causes the performance to drop to 3.9 million packets per second.
We thus configured our rules according to the results obtained during our benchmarks, to optimize our firewalls. All tables on all firewalls have the same configuration so that the level of performance is the same everywhere.
This structure still has one drawback: even if the number of tests per packet is dropped from 80.000 to less than 100, the number of rules easily exceeds 80.000 to reach 750.000! It requires a lot of RAM. Indeed, the rules are directly stored into a memory space dedicated to the Netfilter module. Thankfully, our firewalls have a lot of RAM: they are big machines, since they each handle a large number of clients.
Let us take a fake example, in which the research tree is made of 3 branches of 6 rules each. We here imagine that both the first tests explained earlier (concerning particular cases about our infrastructure) did not match.
At each matched rule, we go to the underlying branch until reaching the last one. In our example, the source IP address matches rule 6.2.4, test 6.2.4 is thus a success and the packet goes on to the TO table, where the destination address will be examined in the same way. Only difference: in this last table, the destination IP address is directly accepted, since there are no more tables to send it to.
Thus, three answers are possible for each test of the tables FROM and TO: jump to the underlying branch, ACCEPT (which is, for the FROM table, equivalent to a jump to the TO table), and DROP (which can happen either if a rule is matched or if, at the end of the tests, the IP address matched no rule).
One last optimization was made. As mentioned in the general presentation of firewalls, they generally are stateful, which means they record connections (conntrack). This state is very useful for the firewall, but has a great cost in CPU time (searching of the connection’s existence) and in RAM (data storage).
We thus decided to set some of our firewall’s rules in a “stateless” mode, which means going from conntrack to “no track” on certain connections. In this case, the latter are not recorder by the firewalls, which are in an hybrid state.
Our firewalls do not record connections that do not use NAT, which corresponds to all requests destined to our services’ IP addresses. Indeed, the latter (as we will see) are configured on our load balancers, the firewall thus has no address translation to do, its sole mission is to analyze the packet and to send it. But how, in that case, can our firewalls spot the responses to these requests, if they have not been recorded? They simply are configured in such a way that packets coming from our services are accepted by default: thus, even if it does not know that this response matches this particular request, it will let it through anyway.
Thanks to this decision, we reduced our recordings to 20.000 connections, when it used to be 1 million. It is a huge gain of RAM, and enables us not to overload our CPUs with non-necessary operations.
Generally : working with NAT
The role of a load balancer is, as its name suggests, to balance the load on the different equipments below it. It will allow, for instance in the case of a website hosted on several servers, to balance the traffic between these equipments in order, for instance, not to get one overloaded server, no longer able to handle all requests, and another one completely unsollicited.
Usually, load balancers use NAT to work.
Let us imagine* the machine A (IPa) wishing to call the B service (VIPb for Virtual IP b) present on the X and Y servers (whose private IP addresses, IPx and IPy, are not known from the Internet either). The sent packets’s source IP will be IPa, and its destination IP will be VIPb. For now, we consider that the packet reached its destination, since it reached the load balancer, one of whose address is VIPb. But it knows that B is hosted both on the X and Y servers, to which it must send the packet for the request to be treated.
That is where NAT comes in: the load balancer changes the destination IP (VIPb) to IPx or IPy, depending on the load on each of the servers, for them to treat the request.
It will also replace the source IP (IPa) by its own IP (IPo for own IP), to get the responses directly. Without that, the firewall above would not recognize the packet sent by X or Y as an answer: he sent a packet from IPa to VIPb but receives a response sent by IPx/IPy! The packet would be dropped, and the response to A’s request would be lost. By modifying the source IP and receiving the response, the load balancer has the opportunity to change back the addresses in order to have them mach what A (and the firewall) expect to get (source IP: VIPb, destination IP: IPa).
The drawback of this method is that is consumes a huge number of IP addresses: indeed, there is one IP address per machine (dedicated to routing; we are here not even counting the physical IP address used for the administration of the machine). This number is multiplied again if we define one IP address per service, like we do at NBS System!
NBS System : working with Direct Routing
We decided to work with Direct Routing. Here, we do not play with IP addresses, but with the physical addresses of the machines (MAC address). Each network card owns a unique MAC address, thus each machine owns one or several unique MAC addresses that define it. With Direct Routing, the packets are also modified, but it is the MAC address rather that IP addresses that is changed.
If we use again the example used earlier*: A (IPa) and our VIPb corresponding to the service B, are still there. What changes in this configuration is that VIPb is not only on the load balancer, but also on the X and Y servers, which do not have any specific IP address destined for the web. X and Y thus have the same IP address: there is now way for the load blancer to differenciate them this way. That is why it uses their MAC address! It is originally not written in the packet, since A does not know it.
Indeed, the MAC address of a machine is known only by the other machines of its network. Our load balancers, that we placed in all of our internal networks and thus know the MAC of all NBS System’s equipments, will thus fill it in. The packet will reach the server with the right IP address and the right MAC address, and there you go!
Another difference between load balancing using NAT and load balancing using Direct Routing is the symmetry of the routing: in the latter case, it is assymetrical, as the reader can see on the diagram. Since the load balancer did not change IPs, the return packet’s source IP will be VIPb and its destination IP will be IPa. The answer thus does not go through the load balancer, but will rather be directly sent back to A, through of course the firewall filtrating all exchanges (we choose not to mention other possible intermediary equipments such as reverse proxies).
In this example only, the number of used IP addresses is already divided by 3. The choice of Direct Routing thus allows us to rationalize the use of our IP addresses.
Firewalls and load balancers are set up in our infrastructure as a whole. All the routing system of these equipments is ensured and automatized by the BGP protocol (Border Gateway Protocol) internally. Thus, if we add a service on the infrastructure (with a new VIP), BGP will announce it and the equipments will be immediatly informed!
However, this functionning requires our firewalls and load balancers to have the same configuration, for them to be able to work together. This configuration is created through an automaton. The administrator of the equipments writes, in this automaton, what he wants, in a language that is comprehensible by men. The automaton will thus create and validate the adapted configuration, in a language that is comprehensible by the machine.
Indeed, we believe that a system in an infrastructure can only be efficient if it is understansable by men. In that case, it is a comfort for the administrator, but also for the people exploiting the equipments: they can apply controlled instructions, and easily ensure the stability and security of exploitation of these equipments.
After our article on reverse proxies, you now know how firewalls and load balancers work at NBS System! See you soon for our next article…
* The name of the IP addresses used in the example are not a referenced notation. We chose to name them that way, in that article, to ease the comprehension of the reader.
Source: Denis Pompilio