Websocket Protocol Overview

Let's continue developing the mental mode with a few additions
Published on 2024/03/14

Today we're taking a look at the Protocol Overview section of the Websocket RFC. It starts outlining the two main parts of the protocol: the handshake and the data transfer. From what we've read so far, we know the handshake is necessary for the two parties to acknowledge the beginning of a communication. Contrary to regular HTTP, we minimize the overhead of all the back and forth and the polling abuse which would require a handshake for every request. Here we just want to establish the start of the exchange and after that, we dive into data transfer.

The handshake from the client looks like this:

GET /chat HTTP/1.1
Host: server.example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Origin: http://example.com
Sec-WebSocket-Protocol: chat, superchat
Sec-WebSocket-Version: 13

There's a pretty extensive RFC to describe the first line which follows the Request-Line format. But to get things rolling, all we need to know is that we specify the request type (e.g. GET, POST), the endpoint (e.g. /chat) and the protocol (e.g. HTTP/1.1). The remaining lines are header fields, the order of these won't matter and there's a specific section in the RFC that goes over them (we'll get there in the future). The handshake from the server looks as follows:

HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
Sec-WebSocket-Protocol: chat

The first line follows the Status-Line format. In this case, we can have an idea of what it represents. It starts by confirming the protocol used, it sends back a status code as a number (e.g. 200, 404), followed by its brief description (e.g. OK, Bad Request). In this specific case, it's saying it will upgrade the connection to switch to websocket.

Once the client and server have both sent their handshakes, and if the handshake was successful, then the data transfer part starts. This is a two-way communication channel where each side can, independently from the other, send data at will.

The only part that can be unclear here is when a handshake is considered successful. While I'm sure it expands on this later, this is simply done when the server sends the response back and the client verifies the Sec-WebSocket-Accept field has the correct value. Both hosts can calculate it based on Sec-WebSocket-Key the client sent.

After a successful handshake, clients and servers transfer data back and forth in conceptual units referred to in this specification as "messages".

This seems intuitive enough based on what we read so far. Client and Server don't have to wait on one another for every message so they can just start sending them over.

On the wire, a message is composed of one or more frames.

This sounds reasonable. Just imagine that messages are not a predefined size. A client can send 20 bytes of data and the server might respond with 100 bytes. There's no limit that we know of up to this point. But since this is a well-defined protocol, messages can be split into frames which is what's sent over the wire. We can assume that the receiving end will know how to put the frames back together to compose the full message. It would be a mess with information exchanged in small predefined bites with no one helping put the pieces together. We don't want the end user to experience that, so the websocket client will have to manage it.

The WebSocket message does not necessarily correspond to a particular network layer framing, as a fragmented message may be coalesced or split by an intermediary.

Aha! Our intuition was correct! Since the message might be split into smaller parts (a.k.a. frames) there won't be a 1:1 correspondence of the message with a frame of any particular network layer. Simply put, while at the application level (layer) the message will be perceived as one unit, that doesn't mean that it was sent over the wire that way. The next sentence confirms the lack of perfect matching between a message and a frame. A very small message might be combined (coalesced) with other data to optimize the transfer into a single frame, alternatively if the message is too big it will be divided (split) into multiple frames. The "intermediary" mentioned can actually be any device part of the network (e.g. router) which will decide how to manage these frames to optimize transfer over the network. While this is good to know, as a user you won't see this at all as the reassembling is carried on "automagically".

A frame has an associated type.

Now that we know about the possibility for a message to be split, we can imagine that different types of data transferred might be represented differently by a frame. In this case, we only mention types.

Each frame belonging to the same message contains the same type of data. Broadly speaking, there are types for textual data (which is interpreted as UTF-8 [RFC3629] text), binary data (whose interpretation is left up to the application), and control frames (which are not intended to carry data for the application but instead for protocol-level signaling, such as to signal that the connection should be closed).

Textual data can be a classic message over chat, binary data is pretty much anything else, it's up to the application to decide how to display it to the end user (e.g. video, image). The last type is expected and handy for the protocol, this way it can notify a client of anything including a "Hey, we're closing shop here!". In HTTP2 (if I remember correctly) the frame used as part of closing a connection is GOAWAY. Which I thought was kinda funny.

This version of the protocol defines six frame types and leaves ten reserved for future use.

I'm curious to know if this is just an upper bound of 9 types or if the other 6 are actually defined but unused (we'll find out later).

Thoughts

The RFC is pretty digestible up to this point. Things get hairy later on, but our goal now is to get a broad understanding. When things get more complicated, I tend to break down each sentence and make sure I understand what everything means. There's no rush really and it's a good way to learn technical writing. Let's keep the ball rolling and move to the next section soon!

0
← Go Back