DesignWare Technical Bulletin

Accelerating Cryptographic Operations in the TLS Protocol

By Derek Bouius, Product Marketing Manager, Security IP

Discover the advantages of offloading TLS cryptographic operations to dedicated hardware to free the main CPU to perform other tasks.

Introduction

The Transport Layer Security (TLS) protocol secures the communication link between applications over the Internet. TLS is now deployed as the default for many web-based connections between clients and servers, enabling payment transactions, protecting personal data, and ensuring safe transmission between devices.

The TLS protocol is implemented directly on top of the transport layer (Figure 1), enabling application protocols above it (e.g. HTTP, SMTP email, etc) to operate unchanged.

Figure 1: TLS in the networking framework

The TLS protocol provides security for communication across a network by preventing eavesdropping, link tampering, or message forgery using the cryptographic methods of encryption, authentication, and data integrity.

  • Encryption methods obfuscate the data sent across the link, making it impossible for eavesdroppers to view the transfer. The AES ciphers are the most common algorithms used in TLS.
  • Authentication methods verify the validity of provided identification material from both sides of the link to ensure the devices are who they say they are. This process uses public key cryptography (also known as asymmetric key cryptography) algorithms like RSA and ECC.
  • Integrity methods prevent the message from being modified or forged. An integrity check is performed by passing the data through a hash operation, such as SHA, to generate a fixed length digest, where a totally different digest is generated when one bit is changed in the original data.

Combined, all three methods create a system to support secure communication. For example, modern web browsers are able to authenticate both the client and server, perform message integrity checks for every record, and provide support for a variety of cipher suites. Due to the increase in attacks on the TLS protocol, the standard is evolving to increase the strength of the cryptographic operations required, as well as defining ways to improve the protocol.

Because of these increasing cryptographic requirements, increased processor load is the most significant limitation to implementing TLS. Cryptography is very CPU intensive, specifically the big number calculations used in public key operations (e.g., modular exponentiation for RSA). As a result, performance varies for both the client and the server in designs using the various TLS cipher suites. The system performance also depends on how often connections are established, how long they last, and the expected data throughput requirements. To increase performance while reducing the load on the main CPU, designers can use dedicated hardware to accelerate the cryptographic algorithms in both client and server applications.

Cryptographic Operations in the TLS Protocol

There are two main phases of the TLS protocol: handshake and application record processing (Figure 2). The first phase is the handshake, which establishes a cryptographically secure data channel. The connection peers agree on the cipher suite to be used and the keys used to encrypt the data. The first exchange of the handshake is for cipher suite negotiation as the client and server determine the cryptographic parameters to be used for the session. These parameters, collectively called a cipher suite, consist of:

  • TLS protocol version to be used: e.g., TLSv1.2, TLSv1.1
  • Key exchange method: e.g., RSA (2048-bit), DHE
  • Secret key cipher method: e.g., AES128, AES256-GCM
  • Digest method: e.g., SHA256, SHA384

The next part of the protocol handshake allows both devices (client and server) to authenticate their identity. This authentication mechanism allows the client to verify that the server is who it claims to be (e.g., your bank when connecting with a browser) and not someone simply pretending to be the desired destination. Verification is based on the established chain of trust through certificate verification using public key signatures. In addition, the server can also optionally verify the client certificate.

At the conclusion of the handshake process, an asymmetric cipher is used to establish and generate a set of shared keys for the session. There are various algorithms available to perform this task. The most common now include the concept of forward secrecy, where the client and server negotiate a key that never crosses the communication path, and is destroyed at the end of the session. One implementation of this is call Ephemeral Diffie-Hellman (DHE) handshake, where the RSA private key from the server is used to sign a Diffie-Hellman key exchange between the client and the server. The pre-master key (PMK) obtained from the Diffie-Hellman handshake is then used for encryption. Since the pre-master key is specific to a connection between a client and a server, and used only for a limited amount of time, it is called "ephemeral".

With this solution, if an attacker gets a hold of the server's private key, it will not be able to decrypt previous communication sessions. Diffie-Hellman ensures that the pre-master keys never leave the client and the server, and cannot be intercepted. The pre-master keys are then used to secure the transmission of the session keys.

Figure 2: The TLS handshake messaging protocol performs server and client authentication along with session key exchange. Components in orange boxes are CPU-intensive cryptographic operations.

The second phase of the protocol is the record processing of application data (see Figure 3). The TLS protocol provides its own message framing mechanism to split application data into 16kB maximum segments. Each message is then signed with a message authentication code (MAC) using the selected hash method, and appends the digest to that message. Next, the message is encrypted with the selected cipher method and the record header is appended. Finally, the receiver parses the header, decrypts the message, and verifies the sent MAC value, ensuring message integrity and authenticity.

Figure 3: TLS record generation for application data

Performance Calculations and Comparison of Cryptographic Acceleration Options for Servers and Clients

The performance of a secure TLS connection is influenced by the selection of the cipher suite, which defines the specific algorithm and the key size. We will now calculate the performance of a specific example using the TLS_DHE_RSA_WITH_AES_128_CBC_SHA256 cipher suite, since this is currently a minimum suggested strength for a secure TLS connection. Of course in any network protocol, normal network effects like latency and throughput have real-world effects. The calculations below only concentrate on the major computational tasks of the end devices. The first calculation profile is the handshake phase.

For the purposes of the calculations, we assume a 3 deep certificate chain (SoC-> Device ODM -> Device OEM) for both the server and client. When the device certificates are exchanged and verified with RSA, the signature verification creates the bulk of the CPU load. The RSA verify process uses a modular exponentiation with a fixed public key exponent, so the operation is not as resource intensive as an RSA signature generation. The modular exponentiation is also the dominant math operation in generating an RSA signature but the exponent is much larger, thus taking more processing time. The modular exponentiation function can be offloaded into a dedicated hardware acceleration engine, called a public key accelerator.

The next major CPU intensive process of this cipher suite handshake is the DHE component, where there are multiple modular exponentiations for both the server and the client. DHE works as follows:

  1. Server sends Client a SERVER KEY EXCHANGE message during the TLS Handshake. The message contains:
    1. Prime number p
    2. Generator g
    3. Server's Diffie-Hellman public value A = g^X mod p, where X is a large random private integer chosen by the server and never shared with the client.
    4. Signature (S) of the above (plus two random values) computed using the server's private RSA key
  2. Client verifies the signature S
  3. Client sends server a CLIENT KEY EXCHANGE message. The message contains:
    1. Client's Diffie-Hellman public value B = g^Y mod p, where Y is a private integer chosen at random and never shared.
  4. Server and Client calculates the pre-master key (PMK) using each other's public values:
    1. Server calculates PMK = B^X mod p
    2. Client calculates PMK = A^Y mod p
  5. Client sends a CHANGE CIPHER SPEC message to the server, and both parties continue the handshake using ENCRYPTED HANDSHAKE MESSAGES

Notes: The prime p, as well as X and Y should not be smaller than the size of the RSA private key. While X and Y can be smaller, the cryptographic strength and computation effort of the calculation is directly proportional to their length. g can be small and has no bearing on the calculation effort.

These operations conclude the CPU intensive operations of the handshake phase since the session master key generation and other setup is trivial in comparison. The end result is that there can be up to 6 modular exponentiations on the server and 5 on the client to perform this handshake, where 3 of each of these are the shorter verify exponent operations.

Typical system on chip (SoC) designs clock cryptographic accelerators at a divider of the main bus frequency which allows the peripherals to run at a more energy efficient clock rate. In our example, we select 200 MHz for the CPU and 100 MHz for the accelerator. Since these operations are all atomic, they scale directly with frequency. Table 1 shows that you can easily achieve a 10 times improvement for number of connections per second by using a hardware accelerated system to accelerate the public key operations.

  Maximum Number of TLS Handshake Connections per Second  
  Software only Hardware accelerated Performance improvement
  CPU @ 200 MHz @ 100 MHz  
Server 1.5 15.2 10.1X
Client 2.2 22.7 10.3X

Table 1: Hardware acceleration can provide 10x the number of connections per second compared to a software implementation, while running at half the frequency, thus requiring less energy.

The record processing phase is less complex and consists of a single SHA-256 digest generation and then AES-128 encryption of the data and digest as shown in Figure 3. These cryptographic operations can also be accelerated with dedicated hardware such as an AES and/or SHA engine or a full protocol accelerator that performs both operations in a single pass of the data. Figure 4 shows the dramatic increase in throughput capability of a protocol accelerator compared to a software implementation. 

Figure 4: TLS record processing throughput can be accelerated greatly with a dedicated hardware implementation to encrypt/decrypt the data and generate/verify the MAC.

The actual operation to optimize in a final system depends on the use case of the secure communication path. For a server where many clients need to connect and send a small amount of data, the number of handshakes per second the server can perform would be the operation to optimize. The record processing phase should be optimized if there are only a few clients connecting to a server, but each client session transfers a large amount of data. In some cases, both operations are critical.

On a client, such as a secure camera, it is most beneficial to optimize the record processing capability with dedicated hardware. If initial connection latency to a server is very important then it may be required to optimize the handshake phase with a public key accelerator.

Conclusion

TLS is a mandatory requirement for securing communication between devices, and due to the attacks on low level cryptography, increased cryptographic computations are required. Dedicated hardware acceleration, like the DesignWare SSL/TLS/DTLS Security Protocol Accelerator, can greatly improve system performance (latency and throughput), while freeing CPU cycles for other tasks.

Additional resources: