Case Study Jobteaser Case Study: Scalable Public APIs with KrakenD

You are viewing a previous version of KrakenD Enterprise Edition (v2.4), go to the latest version

Document updated on Aug 22, 2022

WebSockets Integration

WebSockets Integration

KrakenD Enterprise supports communications using the WebSocket Protocol (RFC-6455) to enable two-way communication between a client to a backend host through the API gateway. This technology aims to provide a mechanism for browser-based applications that need two-way communication with servers that do not rely on opening multiple HTTP connections.

KrakenD has the capability of multiplexing. Each individual end client (e.g., Desktop, Mobile device) establishes a connection with the gateway directly, and KrakenD opens a single channel with the backend host to handle all its connected clients.

The backend decides whether to send the responses to a specific set of client(s) or to broadcast to everyone. This decision depends on the message format

Websockets configuration

The configuration to enable WebSockets is straightforward; the only requirement is to include the websocket namespace at the endpoint level, and that you declare at least one backend host using the ws:// or wss:// schemas.

For each endpoint, KrakenD will open a single connection against one of the hosts. The hosts are load balanced in a Round-Robin fashion but the session once is established is kept permamently.

The flag "disable_host_sanitize": true is also necessary for the backend.

Here there is an example:

{
    "endpoint": "/ws/{room}",
    "input_query_strings": ["*"],
    "input_headers": ["*"],
    "backend": [
        {
            "url_pattern": "/ws",
            "disable_host_sanitize": true,
            "host": [
                "ws://localhost:8081",
                "ws://localhost:8082",
            ]
        }
    ],
    "extra_config": {
        "websocket": {
            "input_headers": [
                "Cookie",
                "Authorization"
            ],
            "connect_event": true,
            "disconnect_event": true,
            "read_buffer_size": 4096,
            "write_buffer_size": 4096,
            "message_buffer_size": 4096,
            "max_message_size": 3200000,
            "write_wait": "10s",
            "pong_wait": "60s",
            "ping_period": "54s",
            "max_retries": 0,
            "backoff_strategy": "exponential"
        }
    }
}

All the fields inside websocket are optional, allowing you to declare an empty object "websocket": {}. The additional options are:

Fields of Schema definition for Websockets
* required fields

backoff_strategy
When the connection to your event source gets interrupted for whatever reason, KrakenD keeps trying to reconnect until it succeeds or until it reaches the max_retries. The backoff strategy defines the delay in seconds in between consecutive failed retries. Defaults to ‘fallback’
Possible values are: "linear" , "linear-jitter" , "exponential" , "exponential-jitter" , "fallback"
Defaults to "fallback"
connect_event boolean
Whether to send notification events to the backend or not when a user establishes a new Websockets connection.
Defaults to false
disconnect_event boolean
Whether to send notification events to the backend or not when users disconnect from their Websockets connection.
Defaults to false
input_headers array
Defines which input headers are allowed to pass to the backend. Notice that you need to declare the input_headers at the endpoint level too.
Defaults to []
max_message_size integer
Sets the maximum size of messages in bytes sent by or returned to the client. Messages larger than this value are discarded by KrakenD and the client disconnected.
Defaults to 512
max_retries integer
The maximum number of times you will allow KrakenD to retry reconnecting to a broken messaging system. Use a value <= 0 for unlimited retries.
Defaults to 0
message_buffer_size integer
Sets the maximum number of messages each client can have in the buffer waiting to be processed. As this is a per-client setting, you must forecast how many consumers of KrakenD websockets you will have. The default value may be too high (memory consumption) if you expect thousands of clients consuming simultaneously.
Defaults to 256
ping_period
Sets the time between pings checking the health of the system.
Specify units using ns (nanoseconds), us or µs (microseconds), ms (milliseconds), s (seconds), m (minutes), or h (hours).
Defaults to "54s"
pong_wait
Sets the maximum time KrakenD will until the pong times out.
Specify units using ns (nanoseconds), us or µs (microseconds), ms (milliseconds), s (seconds), m (minutes), or h (hours).
Defaults to "60s"
read_buffer_size integer
Connections buffer network input and output to reduce the number of system calls when reading messages. You can set the maximum buffer size for reading in bytes.
Defaults to 1024
return_error_details boolean
Provides an error {'error':'reason here'} to the client when KrakenD was unable to send the message to the backend.
Defaults to false
write_buffer_size integer
Connections buffer network input and output to reduce the number of system calls when writing messages. You can set the maximum buffer size for writing in bytes.
Defaults to 1024
write_wait
Sets the maximum time KrakenD will wait until the write times out.
Specify units using ns (nanoseconds), us or µs (microseconds), ms (milliseconds), s (seconds), m (minutes), or h (hours).
Defaults to "10s"

Broadcast vs. directed communication.

No matter how many clients connect to KrakenD through WebSockets, KrakenD keeps one connection with the backend server per endpoint (as depicted above). So, for instance, you might have 1000 concurrent users in a chat room (endpoint /chat) with 1000 sockets opened against KrakenD, but KrakenD still communicates with your backend using one channel.

Depending on your business type, you might have two different needs to communicate with the users connected to the gateway:

  • Send a message to all your clients (broadcast)
  • Send a message only to some user(s) (directed communication)

By default, unrecognized messages received by KrakenD do broadcast to all clients. However, the message format allows you to specify how you would like KrakenD to process your backend responses (broadcast to all connected users or just to a specific session).

Message format

The message format is the mechanism that KrakenD uses to determine if a message needs broadcasting or addressing a specific set of clients.

The recognized message format for bidirectional communication is a JSON object with the following fields:

  • body: Mandatory. A base64 representation of the content.
  • session: The session is a filter to determine who is the sender or the receiver of a message.
  • url: The affected KrakenD endpoint.

KrakenD message for the backend

When the client interacts with KrakenD, the gateway sends messages to the backend with a structure like the following:

{
    "url": "/chat/krakend",
    "session": {
        "uuid": "0b251b07-5611-49e5-b69f-cf2cb8d339d6",
        "Room":"krakend",
    },
    "body": "SGVsbG8gV29ybGQh"
}

The client sent what you can see in the body (“Hello World!” base64 encoded).

The rest of the fields are contextual metadata for your convenience, so the backend can determine who is doing the call and from which originating endpoint.

Essential observations on the example above are:

  • The body is base64 encoded.
  • The session contains information about the client making the request. At least you always receive an uuid randomly assigned by KrakenD to the client when a new session starts.
  • If your endpoint contains placeholders (e.g., as in /chat/{room}), the placeholder parameters are available under session, but using the first letter uppercased. In this example, Room holds the value of {room}).

Backend message for KrakenD

The return from your WS backend needs to use the expected format if you don’t want to broadcast it.

The response body is mandatory, and KrakenD forwards to all connected clients matching the filters you pass. We are talking about broadcasting if you don’t give any filters. The filters are a combination of the session and the url.

If, for instance, you only want to communicate with a specific user, you would have an answer like this one:

{
    "session": { "uuid": "0b251b07-5611-49e5-b69f-cf2cb8d339d6"},
    "body": "SGVsbG8gV29ybGQh"
}

If you want to communicate with all users connected to an endpoint, then you could use:

{
    "url": "/chat/krakend",
    "body": "SGVsbG8gV29ybGQh"
}

And a broadcast:

{
    "body": "SGVsbG8gV29ybGQh"
}

Notice that when the JSON fields url or session exist, the body is sent to the specific subgroup instead of being broadcasted.

Handshaking

The WS protocol consists of an opening handshake followed by basic message framing layered over TCP.

The handshake process is straightforward but necessary for KrakenD to determine if the backend server is alive. KrakenD manages it, which consists of nothing but opening a WebSocket against the backend server and trying to find a response (like a ping/pong). We can simplify it as follows:

krakend => {"msg":"KrakenD WS proxy starting"}
backend => OK

After the successful handshake, KrakenD is ready to start serving and communicating with WS.

Replying with an ‘OK’ is mandatory
The WebSocket server must reply with an OK string. KrakenD requires this string to make sure you are aware that a multiplexed connection requires you to treat an envelope from now on.

Backoff strategies

The backoff_strategy setting defines how KrakenD keeps trying to reconnect to the backend until it succeeds or until it reaches the max_retries. The backoff strategy defines the delay in seconds in between consecutive failed retries. Defaults to fallback:

  • linear: The delay time (d) grows linearly after each failed retry (r) using the formula d = r. E.g., 1st failure retries in 1s, 2nd failure in 2s, 3rd in 3s, and so on.
  • linear-jitter: Similar to linear but adds or subtracts a random number: d = r ± random. The randomness prevents all agents connected to a mutual service from retrying simultaneously as all have a slightly different delay. The random number never exceeds ±r*0.33
  • exponential: Multiplicatively increase the time between retries using d = 2^r. E.g: 2s, 4s, 8s, 16s
  • exponential-jitter: Same as exponential, but adds or subtracts a random number up to 33% of the value using d = 2^r ± random. This is the preferred strategy when you want to protect the system you are consuming.
  • Fallback: When the strategy is missing or none of the above (e.g.:fallback) then it will use constant backoff strategy d=1. Will retry after one second every time.

Understanding WebSockets logs

The nature of WebSocket connections is that they have kind of a “state” and use a lasting connection. Therefore, there are implications to be aware of when connectivity issues or downtimes arise.

Generally speaking, you can read the different levels of errors as:

  • WARNING: There are connectivity issues with the backend
  • ERROR: There are problems renegotiating the connection
  • CRITICAL: The WebSocket connection is lost for good

WS Issues during startup

The most visible problem of all. If for whatever reason, the WebSocket on the backend server is not available during KrakenD startup, KrakenD won’t start. This event shows in the console with a CRITICAL message like this one:

▶ CRITICAL [SERVICE: Websocket] websocket.Dial ws://localhost:8081/ws: dial tcp [::1]:8081: connect: connection refused

If KrakenD relies on WebSocket communication, ensure that the backends are alive when KrakenD starts. Or otherwise, the gateway won’t.

WS Issues during operation

If, on the other hand, the handshake succeeded, but at a given point in time, the backend server or the network connection with the WS dies, the affected endpoint becomes non-operational.

All clients connected to KrakenD during the downtime of your backend’s WebSocket keep their connection with KrakenD, even though KrakenD cannot pass any data from/to the backend server.

KrakenD will keep retrying a broken connection as defined through max_retries and backoff_strategy. While KrakenD is disconnected from the backend, the log will show WARNING messages when clients demand information from it, for instance:

▶ KRAKEND WARNING: [SERVICE: Websocket][Client] Reading from the connection: EOF

Following the backoff_strategy, KrakenD will keep trying to fix this problem, but for each failed retry, KrakenD will show an ERROR in the log:

▶ KRAKEND ERROR: [SERVICE: Websocket][Client] Unable to renew the connection: websocket.Dial ws://localhost:8888/ws: dial tcp 127.0.0.1:8888: connect: connection refused

While the connection with the backend is retrying, all writes remain in queue.

If you have set a limited number of max_retries (greater than 0), when KrakenD has exhausted all the retries, KrakenD will stop trying, and KrakenD will forget the WebSocket connection. You can see this state with a CRITICAL in the logs.

▶ KRAKEND CRITICAL: [SERVICE: Websocket][Client] Unable to reconnect to the backend: websocket.Dial ws://localhost:8888/ws: dial tcp 127.0.0.1:8888: connect: connection refused

In addition, all remaining queued messages will show an error after the critical, as well as new ones:

▶ KRAKEND ERROR: [SERVICE: Websocket] Writing request: empty connection

The client will receive an error too:

{"error":"empty connection"}

At this point, KrakenD stops trying, and you must restart the service. Of course, you can always set max_retries to 0 to keep trying indefinitely.

Another log you can see is:

KRAKEND WARNING: [SERVICE: Websocket][Client] Reading from the connection: websocket: read limit exceeded

When you see the log above is because the client or the backend sent a message larger than permitted by the configuration. The offender will receive a close 1009 (message too big) followed by a disconnect.

Example of failing websocket with max_retries=1

The following is an example log of a websocket that failed and couldn’t reconnect on the single retry we allowed in the configuration (max_retries=1)

▶ KRAKEND WARNING: [SERVICE: Websocket][Client] Reading from the connection: EOF
▶ KRAKEND ERROR: [SERVICE: Websocket][Client] Unable to renew the connection: websocket.Dial ws://localhost:8888/ws: dial tcp 127.0.0.1:8888: connect: connection refused
▶ KRAKEND CRITICAL: [SERVICE: Websocket][Client] Unable to reconnect to the backend: websocket.Dial ws://localhost:8888/ws: dial tcp 127.0.0.1:8888: connect: connection refused
▶ KRAKEND ERROR: [SERVICE: Websocket] Writing request: empty connection
Scarf

Unresolved issues?

The documentation is only a piece of the help you can get! Whether you are looking for Open Source or Enterprise support, see more support channels that can help you.

See all support channels