The performance of a secure TLS connection is influenced by the selection of the cipher suite, which defines the specific algorithm and the key size. We will now calculate the performance of a specific example using the TLS_DHE_RSA_WITH_AES_128_CBC_SHA256 cipher suite, since this is currently a minimum suggested strength for a secure TLS connection. Of course in any network protocol, normal network effects like latency and throughput have real-world effects. The calculations below only concentrate on the major computational tasks of the end devices. The first calculation profile is the handshake phase.
For the purposes of the calculations, we assume a 3 deep certificate chain (SoC-> Device ODM -> Device OEM) for both the server and client. When the device certificates are exchanged and verified with RSA, the signature verification creates the bulk of the CPU load. The RSA verify process uses a modular exponentiation with a fixed public key exponent, so the operation is not as resource intensive as an RSA signature generation. The modular exponentiation is also the dominant math operation in generating an RSA signature but the exponent is much larger, thus taking more processing time. The modular exponentiation function can be offloaded into a dedicated hardware acceleration engine, called a public key accelerator.
The next major CPU intensive process of this cipher suite handshake is the DHE component, where there are multiple modular exponentiations for both the server and the client. DHE works as follows:
- Server sends Client a SERVER KEY EXCHANGE message during the TLS Handshake. The message contains:
- Prime number p
- Generator g
- Server's Diffie-Hellman public value A = g^X mod p, where X is a large random private integer chosen by the server and never shared with the client.
- Signature (S) of the above (plus two random values) computed using the server's private RSA key
- Client verifies the signature S
- Client sends server a CLIENT KEY EXCHANGE message. The message contains:
- Client's Diffie-Hellman public value B = g^Y mod p, where Y is a private integer chosen at random and never shared.
- Server and Client calculates the pre-master key (PMK) using each other's public values:
- Server calculates PMK = B^X mod p
- Client calculates PMK = A^Y mod p
- Client sends a CHANGE CIPHER SPEC message to the server, and both parties continue the handshake using ENCRYPTED HANDSHAKE MESSAGES
Notes: The prime p, as well as X and Y should not be smaller than the size of the RSA private key. While X and Y can be smaller, the cryptographic strength and computation effort of the calculation is directly proportional to their length. g can be small and has no bearing on the calculation effort.
These operations conclude the CPU intensive operations of the handshake phase since the session master key generation and other setup is trivial in comparison. The end result is that there can be up to 6 modular exponentiations on the server and 5 on the client to perform this handshake, where 3 of each of these are the shorter verify exponent operations.
Typical system on chip (SoC) designs clock cryptographic accelerators at a divider of the main bus frequency which allows the peripherals to run at a more energy efficient clock rate. In our example, we select 200 MHz for the CPU and 100 MHz for the accelerator. Since these operations are all atomic, they scale directly with frequency. Table 1 shows that you can easily achieve a 10 times improvement for number of connections per second by using a hardware accelerated system to accelerate the public key operations.