Free-threaded Python¶

Free-threading is an experimental new Python feature that replaces the Global Interpreter Lock (GIL) with a fine-grained locking scheme to better leverage multi-core parallelism. The resulting benefits do not come for free: extensions must explicitly opt-in and generally require careful modifications to ensure correctness.

Nanobind can target free-threaded Python since version 2.2.0. This page explains how to do so and discusses a few caveats. Besides this page, make sure to review py-free-threading.github.io for a more comprehensive discussion of free-threaded Python. PEP 703 explains the nitty gritty details.

Opting in¶

To opt into free-threaded Python, pass the FREE_THREADED parameter to the nanobind_add_module() CMake target command. For other build systems, refer to their respective documentation pages.

nanobind_add_module(
  my_ext                   # Target name
  FREE_THREADED            # Opt into free-threading
  my_ext.h                 # Source code files below
  my_ext.cpp)

nanobind ignores the FREE_THREADED parameter when the registered Python version does not support free-threading.

Note

Stable ABI: Note that there currently is no stable ABI for free-threaded Python, hence the STABLE_ABI parameter will be ignored in free-threaded extensions builds. It is valid to combine the STABLE_ABI and FREE_THREADED arguments: the build system will choose between the two depending on the detected Python version.

Warning

Loading an Python extension that does not support free-threading disables free-threading globally. In larger binding projects with multiple extensions, all of them must be adapted.

If free-threading was requested and is available, the build system will set the NB_FREE_THREADED preprocessor flag. This can be helpful to specialize binding code with #ifdef blocks, e.g.:

#if !defined(NB_FREE_THREADED)
... // simple GIL-protected code
#else
... // more complex thread-aware code
#endif

Caveats¶

Free-threading can violate implicit assumptions made by extension developers when previously serial operations suddenly run concurrently, producing undefined behavior (race conditions, crashes, etc.).

Let’s consider a concrete example: the binding code below defines a Counter class with an increment operation.

struct Counter {
    int value = 0;
    void inc() { value++; }
};

nb::class_<Counter>(m, "Counter")
    .def("inc", &Counter::inc)
    .def_ro("value", &Counter::value);

If multiple threads call the inc() method of a single Counter, the final count will generally be incorrect, as the increment operation value++ does not execute atomically.

To fix this, we could modify the C++ type so that it protects its value member from concurrent modification, for example using an atomic number type (e.g., std::atomic<int>) or a critical section (e.g., based on std::mutex).

The race condition in the above example is relatively benign. However, in more complex projects, combinations of concurrency and unsafe memory accesses could introduce non-deterministic data corruption and crashes.

Another common source of problems are global variables undergoing concurrent modification when no longer protected by the GIL. They will likewise require supplemental locking. The next section explains a Python-specific locking primitive that can be used in binding code besides the solutions mentioned above.

Multi-threaded code that concurrently returns the same C++ instance via the nb::rv_policy::reference policy may observe situations, where multiple Python objects are created that all wrap the same C++ instance (however, this is harmless aside from the duplication).

Python locks¶

Nanobind provides convenience functionality encapsulating the mutex implementation that is part of Python (”PyMutex”). It is slightly more efficient than OS/language-provided synchronization primitives and generally preferable within Python extensions.

The class ft_mutex is analogous to std::mutex, and ft_lock_guard is analogous to std::lock_guard. Note that they only exist to add supplemental critical sections needed in free-threaded Python, while becoming inactive (no-ops) when targeting regular GIL-protected Python.

With these abstractions, the previous Counter implementation could be rewritten as:

struct Counter {
    int value = 0;
    nb::ft_mutex mutex;

    void inc() {
        nb::ft_lock_guard guard(mutex);
        value++;
    }
};

These locks are very compact (sizeof(nb::ft_mutex) == 1), though this is a Python implementation detail that could change in the future.

Argument locking¶

Modifying class and function definitions as shown above may not always be possible. As an alternative, nanobind also provides a way to retrofit supplemental locking onto existing code. The idea is to lock individual arguments of a function before being allowed to invoke it. A built-in mutex present in every Python object enables this.

To do so, call the .lock() member of nb::arg() annotations to indicate that an argument must be locked, e.g.:

nb::arg("my_parameter").lock()
"my_parameter"_a.lock() (short-hand form)

In methods bindings, pass nb::lock_self() to lock the implicit self argument. Note that at most 2 arguments can be locked per function, which is a limitation of the Python locking API.

The example below shows how this functionality can be used to protect inc() and a new merge() function that acquires two simultaneous locks.

struct Counter {
    int value = 0;

    void inc() { value++; }
    void merge(Counter &other) {
        value += other.value;
        other.value = 0;
    }
};

nb::class_<Counter>(m, "Counter")
    .def("inc", &Counter::inc, nb::lock_self())
    .def("merge", &Counter::merge, nb::lock_self(), "other"_a.lock())
    .def_ro("value", &Counter::value);

The above solution has an obvious drawback: it only protects bindings (i.e., transitions from Python to C++). For example, if some other part of a C++ codebase calls merge() directly, the binding layer won’t be involved, and no locking takes place. If such behavior can introduce race conditions, a larger-scale redesign of your project may be in order.

Note

Adding locking annotations indiscriminately is inadvisable because locked calls are more costly than unlocked ones. The .lock() and nb::lock_self() annotations are ignored in GIL-protected builds, hence this added cost only applies to free-threaded extensions.

Furthermore, when adding locking annotations to a function, consider keeping the arguments unnamed (i.e., nb::arg().lock() instead of nb::arg("name").lock()) if the function will never be called with keyword arguments. Processing named arguments causes small binding overheads that may be undesirable if a function that does very little is called at a very high rate.

Note

Python API and locking: When the lock-protected function performs Python API calls (e.g., using wrappers like nb::dict), Python may temporarily release locks to avoid deadlocks. Here, even basic reference counting such as a nb::object variable expiring at the end of a scope counts as an API call.

These locks will be reacquired following the Python API call. This behavior resembles ordinary (GIL-protected) Python code, where operations like Py_DECREF() can cause cause arbitrary Python code to execute. The semantics of this kind of relaxed critical section are described in the Python documentation.

Miscellaneous notes¶

API¶

The following API specific to free-threading has been added:

API stability¶

The interface explained in this is excluded from the project’s semantic versioning policy. Free-threading is still experimental, and API breaks may be necessary based on future experience and changes in Python itself.

Wrappers¶

Wrapper types like nb::list may be used in multi-threaded code. Operations like nb::list::append() internally acquire locks and behave just like their ordinary Python counterparts. This means that race conditions can still occur without larger-scale synchronization, but such races won’t jeopardize the memory safety of the program.

GIL scope guards¶

Prior to free-threaded Python, the nanobind scope guards gil_scoped_acquire and gil_scoped_release would normally be used to acquire/release the GIL and enable parallel regions.

These remain useful and should not be removed from existing code: while no longer blocking operations, they set and unset the current Python thread context and inform the garbage collector.

The gil_scoped_release RAII scope guard class plays a special role in free-threaded builds, since it releases all argument locks held by the current thread.

Immortalization¶

Python relies on a technique called reference counting to determine when an object is no longer needed. This approach can become a bottleneck in multi-threaded programs, since increasing and decreasing reference counts requires coordination among multiple processor cores. Python type and function objects are especially sensitive, since their reference counts change at a very high rate.

Similar to free-threaded Python itself, nanobind avoids this bottleneck by immortalizing functions (nanobind.nb_func, nanobind.nb_method) and type bindings. Immortal objects don’t require reference counting and therefore cannot cause the bottleneck mentioned above. The main downside of this approach is that these objects leak when the interpreter shuts down. Free-threaded nanobind extensions disable the internal leak checker, since it would produce many warning messages caused by immortal objects.

Internal data structures¶

Nanobind maintains various internal data structures that store information about instances and function/type bindings. These data structures also play an important role to exchange type/instance data in larger projects that are split across several independent extension modules.

The layout of these data structures differs between ordinary and free-threaded extensions, therefore nanobind isolates them from each other by assigning a different ABI version tag. This means that multi-module projects will need to consistently compile either free-threaded or non-free-threaded modules.

Free-threaded nanobind uses thread-local and sharded data structures to avoid lock and atomic contention on the internal data structures, which would otherwise become a bottleneck in multi-threaded Python programs.

Thread sanitizers¶

The thread sanitizer (TSAN) offers an effective way of tracking down undefined behavior in multithreaded application.

To use TSAN with nanonbind extensions, you must also create a custom Python build that has TSAN enabled. This is because nanobind internally builds on Python locks. If the implementation of the locks is not instrumented by TSAN, the tool will detect a large volume of false positives.

To make a TSAN-instrumented Python build, download a Python source release and to pass the following options to its configure script:

$ ./configure --disable-gil --with-thread-sanitizer <.. other options ..>