Remove python-gil requirement and enable free-threaded Python#2250
Remove python-gil requirement and enable free-threaded Python#2250ndgrigorian wants to merge 27 commits into
python-gil requirement and enable free-threaded Python#2250Conversation
dcac1f6 to
e450664
Compare
085aece to
5dda977
Compare
|
View rendered docs @ https://intelpython.github.io/dpctl/pulls/2250/index.html |
82e9a5e to
a1d36ce
Compare
|
@antonwolfy @vlad-perevezentsev |
6a5046b to
7d8bbc8
Compare
b610ee3 to
dd74214
Compare
1b70ce6 to
1bc9c3d
Compare
732f37d to
df6f452
Compare
vlad-perevezentsev
left a comment
There was a problem hiding this comment.
No more comments from my side
LGTM
Thank you @ndgrigorian
| // acquire gil to safely call into Python C API | ||
| py::gil_scoped_acquire acquire; | ||
|
|
||
| capi_ptr = new dpctl_capi(); |
There was a problem hiding this comment.
Does switching from a function-local static to a heap pointer means ~dpctl_capi() and the whole Deleter finalization guard (lines 170-186) are now dead code — the held Python objects leak?
I guess leaking a process singleton is a defensible no-GIL trade-off, but it bypasses machinery that was deliberately written for interpreter finalization.
Do we need to add an explicit comment stating it's intentional?
| self._state = _OrderManager(16) | ||
|
|
||
| def __dealloc__(self): | ||
| def __del__(self): |
There was a problem hiding this comment.
__dealloc__ was dead on a pure-Python class; renaming to __del__ makes it actually call SyclEvent.wait_for(...).
Under the free-threading, finalizers run on arbitrary threads and possibly during interpreter shutdown — combined with the existing weakref.finalize path, two finalizers now wait on events.
Confirm this is intended and that it's guarded against shutdown.
| self.__device_map__[key] = dev | ||
| return dev | ||
|
|
||
| def _update_map(self, dev_map): |
There was a problem hiding this comment.
_update_map is public + unlocked — safe today (only called on the not-yet-shared _copy), but a foot-gun if ever called on the live global. We probably might need to consider locking it internally or clear documenting.
There was a problem hiding this comment.
Or alternatively to revert back to cdef at least to make it inaccessible from the python code, i.e. to have more control when the method is called.
| def _update_map(self, dev_map): | ||
| self.__device_map__.update(dev_map) | ||
|
|
||
| def __copy__(self): |
There was a problem hiding this comment.
Seems not covered by any test
| self.__device_queue_map__[ctx_dev] = q | ||
| return q | ||
|
|
||
| def _update_map(self, dev_queue_map): |
There was a problem hiding this comment.
Same comment to _update_map
819f48a to
e536e7b
Compare
free-threaded builds use a new GC that skips PyGC_Head, and this seems to cause some objects to change in size by ~16 bytes
dpctl_capi singleton intiailization could cause deadlocks with updated order manager
e536e7b to
61e7697
Compare
This PR removes the required
python-gildependency from the dpctl conda package workflow and enables free-threaded Python in extension modulesAdjustments are made to the
SequentialOrderManagerclass such that the class is safe in free-threaded, including mutexes in the C++ class as a fall-back in case of (not recommended) simultaneous access to its members and methodsSequentialOrderManager now maintains thread-local storage for individual queue-to-manager-maps, such that each thread has its own manager per queue.
Queue and device caching, meanwhile, are global, to create a concept of default queues and devices that allows operations in extensions like dpnp (which rely on queues being the same for compute follows data) to operate on data passed between threads without copy.
In the futue, per-thread-queues and devices may prove more efficient, in which case, extensions will be asked to be made more robust (checking that context and device are the same, not using queue as a shortcut).
This PR builds on top of work already done removing the tensor submodule, which is pending migration to dpnp