January 10, 2022

Concurrency With Threads

This article is part of the series Async Rust From The Ground Up.

In the previous post in this series, we wrote a simple web server. Here's the code as a quick refresher:

use std::net::{TcpListener, TcpStream};
use std::io::{BufRead, BufReader, Write};

fn main() {
    let listener = TcpListener::bind("localhost:3000").expect("failed to create TCP listener");

    for connection in listener.incoming() {
        let stream = connection.expect("client connection failed");
        handle_connection(stream)
    }
}

fn handle_connection(mut stream: TcpStream) {
    let mut reader = BufReader::new(&mut stream);

    let mut request = Vec::new();
    reader
        .read_until(b'\n', &mut request)
        .expect("failed to read from stream");

    let request = String::from_utf8(request).expect("malformed request line");
    print!("HTTP request line: {}", request);

    let response = concat!(
        "HTTP/1.1 200 OK\r\n",
        "Content-Length: 12\n",
        "Connection: close\r\n\r\n",
        "Hello world!"
    );

    stream
        .write(response.as_bytes())
        .expect("failed to write to stream");
    stream.flush().expect("failed to flush stream");
}

There's a problem with our current implementation, we can only hande one request at a time. Reading and writing from/to the connection isn't instantaneous; there's a lot of infrastructure between us and the user. What would happen if two users made a request to our server at the same time, or ten, or ten thousand?

This might not be a big issue right now because we aren't really doing any work for each connection. But imagine we were. A typical web server might make a couple of database queries, an HTTP request, or read some data from a file. With our server, every user has to wait their turn until we're finished with the user who came before them. Connections will just keep piling up. That doesn't make for a good user experience. And what if users had to keep connections open indefinitely, like in a chat app? A chat app where only one person can chat at a time doesn't sound very useful.

We need to be able to handle multiple requests concurrently (at the same time). To do that, we'll use multithreading.

Hardware Threads

When you think of a thread, you might think of the kind your CPU has 4/8/16 of. These are known as hardware threads, and you usually have two of these per CPU core. As you probably know, CPU cores are physical components that can execute instructions independently. If you have 8 cores, you can truly run 8 things at the same time. Threads however, are virtual units of execution mapped onto cores. Having 16 threads means that the two independent tasks can run on each core. They might not always run at the exact same time, but the CPU will handle scheduling them efficiently. For example, if both threads need different parts of the CPU, they can both run freely. Or if one thread is waiting for a response from some other part of the machine, the other can run in the meantime.

So CPU cores let us run multiple things at the same time, and hardware threads let us efficiently schedule a few things to run on the same core. But how do we actually access threads or cores from our code? And what if we need to run more than 16 tasks?

Operating System Threads

Hardward threads are low level constructs specific to the CPU, you don't generally interact with them directly. Instead, you use the operating system's abstraction over threading. OS threads are generally very lightweight. You can create hundreds or thousands of them, and the operating system will handle scheduling them fairly across CPUs. For example, if one thread calls connection.read and the data is not immediately ready, the OS will switch to running another thread. When the data comes back, the OS will know to switch back to the first thread.

While you can explicitly signal to the OS to run a different thread, like calling read or thread::sleep, the OS scheduler is preemptive. It will recognize threads that have been running for too long and pause them, giving other threads a chance to run.

"Too long" here is in the order of milliseconds. This constant switching is what allows you to run so many programs running on your computer, seemingly at the same time.

 cpu1 cpu2 cpu3 cpu4
 ---- ---- ---- ---- 
| t1 | t3 | t5 | t7 |
|    |____|    |____|
|    |    |____|    |
|____| t4 |    | t8 |
|    |    | t6 |____|
| t2 |____|    |    |
|    | t3 |    | t7 |
|    |    |    |    |

Modifying Our Server

Every program starts with a single thread, the main thread. We can spawn more with std::thread::spawn. Because we want to be able to handle multiple requests concurrently, we'll spawn each request handler onto a new thread:

fn main() {
    let listener = TcpListener::bind("localhost:3000").expect("failed to create TCP listener");

    for connection in listener.incoming() {
        let stream = connection.expect("client connection failed");
        std::thread::spawn(|| handle_connection(stream));
    }
}

The spawn function returns immediately after creating a new thread, running the provided closure there instead of on the main thread. With this simple change, we can now serve many requests at the same time. While one thread is waiting to read, write, or do anything else, others that are ready will be allowed to run. And because the OS scheduler is preemptive, our server will never be blocked on a request that is taking a long time.

This solution works well because our server is I/O bound. We aren't necessarily doing anything between the time we ask to read from the connection, and we get the data back. We're just waiting, limited by the speed of input/output of data across the network. A CPU bound application on the other hand is limited by the processor, because it's constantly doing work. The only way to make it go faster is to add more CPU cores or distribute the work across multiple machines.

In general, the time it takes to serve a web request is mostly spent waiting for other tasks to complete (database queries, HTTP requests, etc.). Multithreading works great because we can utilize that time to handle other requests.

Benchmarks

Let's see how much faster are server really is now! To make things more realistic, we'll sleep for 100 milliseconds inside the connection handler, simulating a datbase request:

fn handle_connection(mut stream: TcpStream) {
    // do some very important work
    std::thread::sleep(Duration::from_millis(100));

    // ...
}

We can benchmark our server using the wrk tool. wrk will send requests to our server and measure how long it takes before it receives a response. We'll tell it to use 12 threads to make requests, and keep 500 connections open at a time.

# single-threaded

$ wrk -t12 -c500 -d20s http://localhost:3000
Running 20s test @ http://localhost:3000
  12 threads and 500 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     8.84s     4.31s   13.16s    78.00%
    Req/Sec     3.18      2.71    10.00     83.50%
  200 requests in 20.10s, 13.48KB read
Requests/sec:      9.95
Transfer/sec:     686.70B
# multithreaded

$ wrk -t12 -c500 -d20s http://localhost:3000
Running 20s test @ http://localhost:3000
  12 threads and 500 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   100.25ms    6.61ms 307.25ms   99.67%
    Req/Sec   400.82     51.15   420.00     95.83%
  95940 requests in 20.08s, 6.31MB read
Requests/sec:   4778.29
Transfer/sec:    321.97KB

The significant numbers above are average latency and throughput. Latency is the total amount of time that it takes for the server to receive a request and return the response. Throughput measures how many requests are responded to in a given period of time.

The single-threaded version was able to handle 10 requests per second, taking on average 10 seconds to respond. On the other hand, the multithreaded server handled almost 5000 requests per second, and latency was just over 100 milliseconds. Given that we sleep for 100 milliseconds inside the handler, that's pretty good :)

While microbenchmarks generally aren't a very good way to measure performance, and you should always measure your specific use case, I think it's safe to say the multithreaded version is much much better.

Next Steps

We covered a lot in this post, CPU cores, hardware threads, operating threads, preemptive scheduling, and I/O bound vs. CPU bound applications. We were able to significantly improve the performance of our application by making it multithreaded. In the next post, we'll go over some improvements we can make to this approach by writing a threadpool.