A Lock-Free Vector

July 26, 2023

The ideas laid out in this crate have since been implemented in the boxcar crate.

A while back while researching for seize, I came across a cool little idea for a concurrent vector in this paper. I'm not sure where the idea originated from, but it's also the basis for the concurrent vector in Intel TBB.

Introduction

Supporting remove in a concurrent setting is a particularly difficult problem. For one, synchronized removal requires some form of concurrent memory reclamation, but on top of that a concurrent vector must guarantee stable addresses for any references returned by get, so relocating to a new allocation is not an option. However, if the interface is limited to just get and push, an append-only vector, things become more interesting.

The basic design for a lock-free append-only vector can be illustrated with a simplified version that only supports a bounded number of elements.

struct Vector<T> {
    index: AtomicUsize,
    elements: Box<[Element<T>; 64]>
}

struct Element<T> {
    value: UnsafeCell<MaybeUninit<T>>,
    stored: AtomicBool,
}

impl<T> Vector<T> {
    fn get(&self, i: usize) -> Option<&T> {
        if self.elements[i].stored.load(Acquire) {
            let value = unsafe { (*self.elements[i].value.get()).assume_init_ref() };
            return Some(value);
        }

        None
    }

    fn push(&self, value: T) -> usize {
        let i = self.index.fetch_add(1, Relaxed);
        unsafe { self.elements[i].value.get().write(MaybeUninit::new(value)) }
        self.elements[i].stored.store(true, Release);
        i
    }
}

Writer threads acquire an index into the vector through a shared counter. The counter does create a single point of contention, but allows writes to be lock-free by letting threads write ahead even if the previous elements aren't yet initialized. Because of this, we can't rely on the counter to determine which elements are initialized, so entries need an additional atomic flag.

This approach is completely lock-free and allows for incredibly cheap reads.

Memory Layout

An unbounded concurrent vector needs a stable, yet growable backing storage. The most obvious solution is a linked list of fixed-size chunks.

struct Vector<T> {
    index: AtomicUsize,
    head: *mut Node<T>,
}

struct Node<T> {
    elements: [Element<T>; 64],
    next: AtomicPtr<Node<T>>,
}

struct Element<T> {
    value: UnsafeCell<MaybeUninit<T>>,
    stored: AtomicBool,
}

However, random access suffers heavily under this layout. Writes can be optimized by caching the tail node, but accessing a value at a given index still requires iterating through the entire list until the corresponding chunk is found, likely touching a number of unrelated cache lines.

A more promising idea is to pre-allocate an array of chunks, lazily allocating each chunk.

struct Vector<T> {
    index: AtomicUsize,
    chunks: [AtomicPtr<Element<T>>; 64],
}

struct Element<T> {
    value: UnsafeCell<MaybeUninit<T>>,
    stored: AtomicBool,
}

The question now becomes how many elements each chunk should store. Every chunk being the same size is infeasible without arbitrarily limiting the size of the vector or allocating unnecessarily large chunks, so they must grow in size. We can mimic the 2x growth factor of a vector by having each chunk be double the size of the previous one.

The beauty of this approach is that because chunks grow in powers of two, we can use ⌊log2(index)⌋ to map an arbitrary index to its corresponding chunk. This can be efficiently calculated using the leading zeros instruction.

struct Vector<T> {
    index: AtomicUsize,
    // chunks of size 1, 2, 4, 8, 16, ...
    chunks: [AtomicPtr<Element<T>>; 64],
}

fn get(&self, i: usize) -> Option<&T> {
    let chunk = usize::BITS - (i + 1).leading_zeros() - 1;
    let i = (i + 1) ^ (1 << chunk);

    let chunk = self.chunks[chunk as usize].load(Ordering::Acquire);
    if chunk.is_null() {
        return None;
    }

    let element = unsafe { &*chunk.add(i) };
    if !element.stored.load(Ordering::Acquire) {
        return None;
    }

    let value = unsafe { (*element.value.get()).assume_init_ref() };
    Some(value)
}

Write Concurrency

With this new layout, writers must deal with the fact that chunks are lazily allocated, so their allocation must be synchronized.

The simplest solution to this is to just use a Mutex. The lock only needs to be acquired during the transition into a new chunk, which should be a rare occurrence as the size of the vector increases.

Alternatively, a lock-free approach would let threads race to allocate. Any threads that lose the race deallocate their chunk and continue with the chunk that was written.

fn get_or_alloc(chunk: &AtomicPtr<Element<T>>, len: usize) -> *mut Entry<T> {
    let entries = alloc_chunk(len);
    match chunk.compare_exchange(
        ptr::null_mut(),
        entries,
        Ordering::Release,
        Ordering::Acquire,
    ) {
        Ok(_) => entries,
        Err(found) => unsafe {
            dealloc_chunk(entries, len);
            found
        },
    }
}

It turns out this approach doesn't scale very well in practice. Remember, it's not only the first element in a chunk that will attempt to allocate, it's also every subsequent writer that acquires an index until a chunk is written and visible to other threads. This means that under high write pressure, it's likely that every active writer will allocate its own chunk, only for all of them to be deallocated except the winner. This results in a sort of thundering herd problem, causing unnecessary allocator pressure and contention that is avoided with a lock.

There is a way to mitigate the issue. By having a single writer eagerly allocate the next chunk as it gets close to overflowing, writers can transition smoothly into the next chunk and avoid any contention.

pub fn push(&self, value: T) -> usize {
    // acquire an index
    let index = self.index.fetch_add(1, Ordering::Relaxed);
    let chunk = usize::BITS - (i + 1).leading_zeros() - 1;
    let chunk_size = 1 << chunk;

    // eagerly allocate the next chunk if we are close to the end of this one
    if index == (location.chunk_size - (location.chunk_size >> 3)) {
        if let Some(next_chunk) = self.chunks.get(location.chunk + 1) {
            alloc(next_chunk, location.chunk_size << 1);
        }
    }

    // ...
}

Additionally, chunk sizes can start at a number like 32 or 64 instead of 1, avoiding the unnecessary allocations and contention at smaller powers of two.