"Cannot deserialize object" error

Hello!

I am seeing this error in the driver log for Dask on Ray workloads on a Ray cluster. Any idea what this error means? It happens on some subset of dataset in our PB data workloads. I could try to get a reproducible example if necessary but wanted to get a general sense of what this error means first. Thanks!


(pid=936, ip=10.0.209.100) 2021-06-16 22:31:22,368      ERROR serialization.py:251 -- Can't deserialize object: ObjectRef(4de37b2f36cebec8ffffffffffffffffffffffff0100000001000000), metadata: b'\xc6PYTHO'
(pid=936, ip=10.0.209.100) Traceback (most recent call last):
(pid=936, ip=10.0.209.100)   File "/usr/local/lib/python3.7/site-packages/ray/serialization.py", line 204, in _deserialize_object
(pid=936, ip=10.0.209.100)     error_type = int(metadata_fields[0])
(pid=936, ip=10.0.209.100) ValueError: invalid literal for int() with base 10: b'\xc6PYTHO'
(pid=936, ip=10.0.209.100)
(pid=936, ip=10.0.209.100) During handling of the above exception, another exception occurred:
(pid=936, ip=10.0.209.100)
(pid=936, ip=10.0.209.100) Traceback (most recent call last):
(pid=936, ip=10.0.209.100)   File "/usr/local/lib/python3.7/site-packages/ray/serialization.py", line 249, in deserialize_objects
(pid=936, ip=10.0.209.100)     obj = self._deserialize_object(data, metadata, object_ref)
(pid=936, ip=10.0.209.100)   File "/usr/local/lib/python3.7/site-packages/ray/serialization.py", line 206, in _deserialize_object
(pid=936, ip=10.0.209.100)     raise Exception(f"Can't deserialize object: {object_ref}, "
(pid=936, ip=10.0.209.100) Exception: Can't deserialize object: ObjectRef(4de37b2f36cebec8ffffffffffffffffffffffff0100000001000000), metadata: b'\xc6PYTHO'
(pid=21761, ip=10.0.219.77) 2021-06-16 22:38:02,608     ERROR serialization.py:251 -- Can't deserialize object: ObjectRef(20a01508c02b6a05ffffffffffffffffffffffff0100000001000000), metadata: b'\x1e\x15\xbf\x00\x04\x15'
(pid=21761, ip=10.0.219.77) Traceback (most recent call last):
(pid=21761, ip=10.0.219.77)   File "/usr/local/lib/python3.7/site-packages/ray/serialization.py", line 204, in _deserialize_object
(pid=21761, ip=10.0.219.77)     error_type = int(metadata_fields[0])
(pid=21761, ip=10.0.219.77) ValueError: invalid literal for int() with base 10: b'\x1e\x15\xbf\x00\x04\x15'
(pid=21761, ip=10.0.219.77)
(pid=21761, ip=10.0.219.77) During handling of the above exception, another exception occurred:
(pid=21761, ip=10.0.219.77)
(pid=21761, ip=10.0.219.77) Traceback (most recent call last):
(pid=21761, ip=10.0.219.77)   File "/usr/local/lib/python3.7/site-packages/ray/serialization.py", line 249, in deserialize_objects
(pid=21761, ip=10.0.219.77)     obj = self._deserialize_object(data, metadata, object_ref)
(pid=21761, ip=10.0.219.77)   File "/usr/local/lib/python3.7/site-packages/ray/serialization.py", line 206, in _deserialize_object
(pid=21761, ip=10.0.219.77)     raise Exception(f"Can't deserialize object: {object_ref}, "
(pid=21761, ip=10.0.219.77) Exception: Can't deserialize object: ObjectRef(20a01508c02b6a05ffffffffffffffffffffffff0100000001000000), metadata: b'\x1e\x15\xbf\x00\x04\x15'

It seems like there were some corruption when serializing the original object reference. I am not aware of any known issue. cc @suquark have you seen this error before?

Yes, there is some memory corruption happening (the object metadata was corrupted, which means some threads are writing into wrong memory addresses). I think I saw this before but I could not recall where I saw it. A simple reproducible script would be helpful.

Hey all,

I have to recreate some weird data on our side to reproduce this issue, so I’m working on that.

With the latest commit on ray master though, I’m getting a slightly different error message below;

[2021-06-23 16:19:13,039 C 416 448] object_buffer_pool.cc:74: Check failed: data_size == static_cast<uint64_t>(object_buffer.data->Size() + object_buffer.metadata->Size())

Do you know what this error might indicate?

1 Like

I think this is the same kind of issue (seems like the metadata size and the data size was corrupted). It is probably that the error was caught at the different place.