Using modin with ray trying to save to and load from parquet without success. losing my mind

Liquidmasl · August 6, 2024, 3:47pm

Hi guys and gals.

This question was long and full of desperation, while writing it I tried more stuff and found out a bunch more. So I hid the wall of text in case someone wants to read it.

I currently try to load a large ply, store it as parquet and load it again.
On loading the ply I load it in batches of 1mil rows, with those rows I make a bunch of modin dataframes that i concatenate together.
At the end I save them with modins to_parquet().
(After tears and confusion I managed that, see the old post)

But now I fail to load it up again on the same machine where i just saved it.
Modins read_parquet() seams to just have issues (filling up RAM until pc dies), so i attempted to use rays read_parquet() and then convert it into modin.
But I also run into issues here:

raylet.out tail

========== Plasma store: =================
Current usage: 14.1228 / 14.4678 GB

num bytes created total: 14122782072
1 pending objects of total size 13468MB
objects spillable: 0
bytes spillable: 0
objects unsealed: 1
bytes unsealed: 14122667401
objects in use: 22
bytes in use: 14122781888
objects evictable: 0
bytes evictable: 0
objects created by worker: 22
bytes created by worker: 14122781888
objects restored: 0
bytes restored: 0
objects received: 0
bytes received: 0
objects errored: 0
bytes errored: 0

[2024-08-06 17:36:44,465 I 27020 27024] (raylet.exe) node_manager.cc:656: Sending Python GC request to 23 local workers to clean up Python cyclic references.
[2024-08-06 17:36:46,464 I 27020 27388] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 14122667401
[2024-08-06 17:36:48,757 I 27020 27388] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 14122667401
[2024-08-06 17:36:48,764 I 27020 27388] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 14122667401
[2024-08-06 17:36:50,306 C 27020 27388] (raylet.exe) dlmalloc.cc:129: Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1450
*** StackTrace Information ***
unknown
unknown
unknown

I understand that the object store is full, and this is okay. I would like ray to just spill whatever doesnt fit. but it does not seam to spill anything, why?

The parquet has just 20 parts, while when I loaded the ply originally i cut it in about 1200 parts and concatenated them.
Would it help seperating into more parquets and doing the same?
Or loading one by one and combining manually?
Whats the right approach here?

Old post

I am here out of pure desperation.
I am also unsure if my issues are modin or ray related, or both.

My issues are already posted as issues in the modin github

[RAY] to_parquet() fails when spilled objects reach 64gig… Also my data is just 40gig · Issue #7360 · modin-project/modin (github.com)
BUG: Behaviour differs if ray is imported and or initialized manually. Without manual import: it fails · Issue #7359 · modin-project/modin (github.com)
and more… in other issues as comments

I am running on just one single machine, with 20 logical processors and 64gb of ram

What I am trying to do:
I try to load a .ply (43gb) and save it as .parquet(s) and load them again.

I import ray manually and call init() with no parameters
then after; i import modin

I currently load my .ply in chunks of 1mil rows, put them into seperate modin dataframes, and concaternating them together at last. This works perfectly, and relatively fast.
In the end I have a modin dataframe with ray backend with some 12 mrd rows.

Then I try to save them with modins .to_parquet()
This step fails if I dont import and init ray manually. But when I do it… generally works.
But if it does not I get

(raylet.exe) dlmalloc.cc:129:  Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1455

I had this error before, and noticed that this happens exactly when the spill folder reaches 64gb. I could remedy this by setting os.environ[‘MODIN_MEMORY’] to something higher.
Even though, if i am not mistaken, it makes no sense that this helps, as because i am calling init() myself, modin is not initiliasing the ray at all and therefor not using that env variable.

But after waking up today, it stopped working again. with the same error.
So I guess that was just a coincidence. Possibly it worked one time and not the other time because memory is calculated with the size of the virtual memory/pageing file?, which is not staticper default in windows?

When initializing ray, what is the difference between _memory and object_store_memory ? Does it make sense to be set to the same value? why or why not?
If I let modin initialise ray it sets both to the same value (0.6 of my virtual memory). Does this lead to my ram filling up and my pc dieing?

If I understood correctly, when the object store is full, it should just spill to disk. But apparently this memory is capped somewhere as well. Where? how? why?
If I understand correctly this spilling is attempted but fails (raylet.out excerpt at the bottom).
Is this the _memory?

Liquidmasl · August 6, 2024, 4:00pm

I managed to load the parquet files by manually iterating over the .parquet files in the .parquet folder and concatenating the sub dataframes, similar to my initial load.

Now it fails at the first operation, which is a df.sort()
totally fair seams like a tough thing todo out of core, but then again I fail to understand the logs.

raylet.out tail

[2024-08-06 17:53:40,919 I 18208 17048] (raylet.exe) node_manager.cc:525: [state-dump] NodeManager:
[state-dump] Node ID: 11c5bc1e90bcd260a4d9cd7a677060b61cbbe24898aa765e8491f8f8
[state-dump] Node name: 127.0.0.1
[state-dump] InitialConfigResources: {memory: 2684354560000000, object_store_memory: 130524131320000, GPU: 10000, accelerator_type:G: 10000, node:internal_head: 10000, node:127.0.0.1: 10000, CPU: 200000}
[state-dump] ClusterTaskManager:
[state-dump] ========== Node: 11c5bc1e90bcd260a4d9cd7a677060b61cbbe24898aa765e8491f8f8 =================
[state-dump] Infeasible queue length: 0
[state-dump] Schedule queue length: 0
[state-dump] Dispatch queue length: 6
[state-dump] num_waiting_for_resource: 0
[state-dump] num_waiting_for_plasma_memory: 5
[state-dump] num_waiting_for_remote_node_resources: 0
[state-dump] num_worker_not_started_by_job_config_not_exist: 0
[state-dump] num_worker_not_started_by_registration_timeout: 0
[state-dump] num_tasks_waiting_for_workers: 1
[state-dump] num_cancelled_tasks: 0
[state-dump] cluster_resource_scheduler state:
[state-dump] Local id: 6699133724476589756 Local resources: {“total”:{memory: [2684354560000000], node:internal_head: [10000], object_store_memory: [130524131320000], GPU: [10000], node:127.0.0.1: [10000], accelerator_type:G: [10000], CPU: [200000]}}, “available”: {CPU: [160000], node:internal_head: [10000], GPU: [10000], object_store_memory: [41676085680000], node:127.0.0.1: [10000], accelerator_type:G: [10000], memory: [2684354560000000]}}, “labels”:{“ray.io/node_id":"11c5bc1e90bcd260a4d9cd7a677060b61cbbe24898aa765e8491f8f8”,} is_draining: 0 is_idle: 0 Cluster resources: node id: 6699133724476589756{“total”:{accelerator_type:G: 10000, GPU: 10000, object_store_memory: 130524131320000, node:127.0.0.1: 10000, memory: 2684354560000000, node:internal_head: 10000, CPU: 200000}}, “available”: {GPU: 10000, accelerator_type:G: 10000, memory: 2684354560000000, object_store_memory: 41676085680000, node:127.0.0.1: 10000, CPU: 160000, node:internal_head: 10000}}, “labels”:{“ray.io/node_id":"11c5bc1e90bcd260a4d9cd7a677060b61cbbe24898aa765e8491f8f8”,}, “is_draining”: 0, “draining_deadline_timestamp_ms”: -1} { “placment group locations”: , “node to bundles”: }
[state-dump] Waiting tasks size: 11
[state-dump] Number of executing tasks: 5
[state-dump] Number of pinned task arguments: 81
[state-dump] Number of total spilled tasks: 0
[state-dump] Number of spilled waiting tasks: 0
[state-dump] Number of spilled unschedulable tasks: 0
[state-dump] Resource usage {
[state-dump] - (language=PYTHON actor_or_task=_deploy_ray_func pid=34212 worker_id=c335baa0647805237ea166ae5612bf3fb4972e18f08183fbe50b482b): {CPU: 10000}
[state-dump] - (language=PYTHON actor_or_task=_deploy_ray_func pid=25720 worker_id=4d6a02e2097c2759b8a6afef63d8c0d2f29595df401467c692389b7d): {CPU: 10000}
[state-dump] - (language=PYTHON actor_or_task=_deploy_ray_func pid=32744 worker_id=ca71e19b7f9b29f14a488c8b669b33d2408e2806169147e07c971863): {CPU: 10000}
[state-dump] - (language=PYTHON actor_or_task=_deploy_ray_func pid=34960 worker_id=f1c1bc3b54e5cef6d9812d0460bc9a12d4bec70805678f7ee0184884): {CPU: 10000}
[state-dump] }
[state-dump] Running tasks by scheduling class:
[state-dump] - {depth=1 function_descriptor={type=PythonFunctionDescriptor, module_name=ray.data._internal.stats, class_name=_StatsActor, function_name=init, function_hash=d0b7803d915a49409aff6b327f9190ef} scheduling_strategy=node_affinity_scheduling_strategy {
[state-dump] node_id: “\021\305\274\036\220\274\322\244\331\315zgp\266\034\273\342H\230\252v^\204\221\370\370”
[state-dump] }
[state-dump] resource_set={}}: 1/18446744073709551615
[state-dump] - {depth=1 function_descriptor={type=PythonFunctionDescriptor, module_name=modin.core.execution.ray.implementations.pandas_on_ray.partitioning.virtual_partition, class_name=, function_name=_deploy_ray_func, function_hash=bcc424c2f49e4803b4dbb629bc272513} scheduling_strategy=default_scheduling_strategy {
[state-dump] }
[state-dump] resource_set={CPU : 1, }}: 4/20
[state-dump] ==================================================
[state-dump]
[state-dump] ClusterResources:
[state-dump] LocalObjectManager:
[state-dump] - num pinned objects: 0
[state-dump] - pinned objects size: 0
[state-dump] - num objects pending restore: 0
[state-dump] - num objects pending spill: 4
[state-dump] - num bytes pending spill: 8884804564
[state-dump] - num bytes currently spilled: 71035560890
[state-dump] - cumulative spill requests: 816
[state-dump] - cumulative restore requests: 633
[state-dump] - spilled objects pending delete: 0
[state-dump]
[state-dump] ObjectManager:
[state-dump] - num local objects: 88
[state-dump] - num unfulfilled push requests: 0
[state-dump] - num object pull requests: 18
[state-dump] - num chunks received total: 0
[state-dump] - num chunks received failed (all): 0
[state-dump] - num chunks received failed / cancelled: 0
[state-dump] - num chunks received failed / plasma error: 0
[state-dump] Event stats:
[state-dump] Global stats: 66 total (0 active)
[state-dump] Queueing time: mean = 1.943 ms, max = 25.783 ms, min = 8.232 us, total = 128.212 ms
[state-dump] Execution time: mean = 9.728 us, total = 642.072 us
[state-dump] Event stats:
[state-dump] ObjectManager.FreeObjects - 66 total (0 active), Execution time: mean = 9.728 us, total = 642.072 us, Queueing time: mean = 1.943 ms, max = 25.783 ms, min = 8.232 us, total = 128.212 ms
[state-dump] PushManager:
[state-dump] - num pushes in flight: 0
[state-dump] - num chunks in flight: 0
[state-dump] - num chunks remaining: 0
[state-dump] - max chunks allowed: 409
[state-dump] OwnershipBasedObjectDirectory:
[state-dump] - num listeners: 18
[state-dump] - cumulative location updates: 15788333336426
[state-dump] - num location updates per second: 0.000
[state-dump] - num location lookups per second: 0.000
[state-dump] - num locations added per second: 0.000
[state-dump] - num locations removed per second: 0.000
[state-dump] BufferPool:
[state-dump] - create buffer state map size: 0
[state-dump] PullManager:
[state-dump] - num bytes available for pulled objects: 0
[state-dump] - num bytes being pulled (all): 2221208186
[state-dump] - num bytes being pulled / pinned: 2221208186
[state-dump] - get request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - wait request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - task request bundles: BundlePullRequestQueue{16 total, 1 active, 15 inactive, 0 unpullable}
[state-dump] - first get request bundle: N/A
[state-dump] - first wait request bundle: N/A
[state-dump] - first task request bundle: 3 objects, 2221208186 bytes (inactive, waiting for capacity)
[state-dump] - num objects queued: 18
[state-dump] - num objects actively pulled (all): 3
[state-dump] - num objects actively pulled / pinned: 3
[state-dump] - num bundles being pulled: 1
[state-dump] - num pull retries: 0
[state-dump] - max timeout seconds: 10
[state-dump] - max timeout request is already processed. No entry.
[state-dump]
[state-dump] WorkerPool:
[state-dump] - registered jobs: 2
[state-dump] - process_failed_job_config_missing: 0
[state-dump] - process_failed_rate_limited: 0
[state-dump] - process_failed_pending_registration: 0
[state-dump] - process_failed_runtime_env_setup_failed: 0
[state-dump] - num PYTHON workers: 28
[state-dump] - num PYTHON drivers: 2
[state-dump] - num object spill callbacks queued: 0
[state-dump] - num object restore queued: 0
[state-dump] - num util functions queued: 0
[state-dump] - num idle workers: 16
[state-dump] TaskDependencyManager:
[state-dump] - task deps map size: 16
[state-dump] - get req map size: 0
[state-dump] - wait req map size: 0
[state-dump] - local objects map size: 88
[state-dump] WaitManager:
[state-dump] - num active wait requests: 0
[state-dump] Subscriber:
[state-dump] Channel WORKER_OBJECT_EVICTION
[state-dump] - cumulative subscribe requests: 860
[state-dump] - cumulative unsubscribe requests: 440
[state-dump] - active subscribed publishers: 1
[state-dump] - cumulative published messages: 440
[state-dump] - cumulative processed messages: 440
[state-dump] Channel WORKER_OBJECT_LOCATIONS_CHANNEL
[state-dump] - cumulative subscribe requests: 2866
[state-dump] - cumulative unsubscribe requests: 2848
[state-dump] - active subscribed publishers: 1
[state-dump] - cumulative published messages: 3144
[state-dump] - cumulative processed messages: 1965
[state-dump] Channel WORKER_REF_REMOVED_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] num async plasma notifications: 0
[state-dump] Remote node managers:
[state-dump] Event stats:
[state-dump] Global stats: 81938 total (69 active)
[state-dump] Queueing time: mean = 25.725 ms, max = 40.485 s, min = -0.001 s, total = 2107.849 s
[state-dump] Execution time: mean = 138.049 ms, total = 11311.475 s
[state-dump] Event stats:
[state-dump] NodeManager.SpillObjects - 8539 total (1 active), Execution time: mean = 13.282 us, total = 113.412 ms, Queueing time: mean = 285.694 us, max = 54.399 ms, min = 1.879 us, total = 2.440 s
[state-dump] NodeManager.GlobalGC - 8539 total (1 active), Execution time: mean = 685.499 ns, total = 5.853 ms, Queueing time: mean = 286.598 us, max = 54.395 ms, min = 1.566 us, total = 2.447 s
[state-dump] NodeManagerService.grpc_server.ReportWorkerBacklog.HandleRequestImpl - 6544 total (1 active), Execution time: mean = 40.975 us, total = 268.137 ms, Queueing time: mean = 1.122 ms, max = 280.977 ms, min = 2.629 us, total = 7.346 s
[state-dump] NodeManagerService.grpc_server.ReportWorkerBacklog - 6544 total (1 active), Execution time: mean = 1.436 ms, total = 9.395 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] CoreWorkerService.grpc_client.PubsubCommandBatch.OnReplyReceived - 3058 total (0 active), Execution time: mean = 180.878 us, total = 553.125 ms, Queueing time: mean = 333.780 us, max = 314.878 ms, min = 3.847 us, total = 1.021 s
[state-dump] CoreWorkerService.grpc_client.PubsubCommandBatch - 3058 total (0 active), Execution time: mean = 1.292 ms, total = 3.950 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ClientConnection.async_read.ProcessMessageHeader - 2632 total (30 active), Execution time: mean = 8.745 us, total = 23.017 ms, Queueing time: mean = 761.395 ms, max = 40.485 s, min = 2.673 us, total = 2003.992 s
[state-dump] ClientConnection.async_read.ProcessMessage - 2602 total (0 active), Execution time: mean = 227.944 us, total = 593.110 ms, Queueing time: mean = 220.327 us, max = 305.922 ms, min = 1.923 us, total = 573.291 ms
[state-dump] RaySyncer.OnDemandBroadcasting - 2157 total (1 active), Execution time: mean = 120.681 us, total = 260.309 ms, Queueing time: mean = 11.201 ms, max = 327.811 ms, min = -0.001 s, total = 24.160 s
[state-dump] NodeManager.CheckGC - 2157 total (1 active), Execution time: mean = 181.869 us, total = 392.291 ms, Queueing time: mean = 11.140 ms, max = 327.810 ms, min = -0.001 s, total = 24.030 s
[state-dump] ObjectManager.UpdateAvailableMemory - 2157 total (0 active), Execution time: mean = 143.997 us, total = 310.602 ms, Queueing time: mean = 802.706 us, max = 95.055 ms, min = 1.531 us, total = 1.731 s
[state-dump] CoreWorkerService.grpc_client.PubsubLongPolling - 2144 total (1 active), Execution time: mean = 105.431 ms, total = 226.044 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] CoreWorkerService.grpc_client.PubsubLongPolling.OnReplyReceived - 2143 total (0 active), Execution time: mean = 303.204 us, total = 649.765 ms, Queueing time: mean = 403.038 us, max = 294.286 ms, min = 3.293 us, total = 863.711 ms
[state-dump] Subscriber.HandlePublishedMessage_WORKER_OBJECT_LOCATIONS_CHANNEL - 1965 total (0 active), Execution time: mean = 28.264 us, total = 55.538 ms, Queueing time: mean = 398.147 us, max = 3.422 ms, min = 26.231 us, total = 782.359 ms
[state-dump] RaySyncer.BroadcastMessage - 1933 total (0 active), Execution time: mean = 188.863 us, total = 365.072 ms, Queueing time: mean = 1.919 us, max = 728.936 us, min = 164.000 ns, total = 3.710 ms
[state-dump] - 1933 total (0 active), Execution time: mean = 27.158 us, total = 52.497 ms, Queueing time: mean = 313.880 us, max = 47.176 ms, min = 2.263 us, total = 606.731 ms
[state-dump] CoreWorkerService.grpc_client.UpdateObjectLocationBatch.OnReplyReceived - 1881 total (0 active), Execution time: mean = 68.430 us, total = 128.717 ms, Queueing time: mean = 905.268 us, max = 315.011 ms, min = 2.957 us, total = 1.703 s
[state-dump] CoreWorkerService.grpc_client.UpdateObjectLocationBatch - 1881 total (0 active), Execution time: mean = 1.702 ms, total = 3.202 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ObjectManager.ObjectAdded - 1515 total (0 active), Execution time: mean = 599.431 us, total = 908.138 ms, Queueing time: mean = 1.234 ms, max = 288.828 ms, min = 5.604 us, total = 1.870 s
[state-dump] ObjectManager.ObjectDeleted - 1427 total (0 active), Execution time: mean = 86.146 us, total = 122.930 ms, Queueing time: mean = 1.090 ms, max = 40.857 ms, min = 4.481 us, total = 1.556 s
[state-dump] NodeManagerService.grpc_server.RequestWorkerLease.HandleRequestImpl - 1277 total (0 active), Execution time: mean = 181.013 us, total = 231.153 ms, Queueing time: mean = 704.484 us, max = 28.314 ms, min = 3.489 us, total = 899.626 ms
[state-dump] NodeManagerService.grpc_server.RequestWorkerLease - 1277 total (17 active), Execution time: mean = 7.261 s, total = 9272.839 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] WorkerPool.PopWorkerCallback - 1260 total (0 active), Execution time: mean = 443.819 us, total = 559.212 ms, Queueing time: mean = 249.888 us, max = 21.548 ms, min = 8.531 us, total = 314.859 ms
[state-dump] NodeManagerService.grpc_server.ReturnWorker.HandleRequestImpl - 1256 total (0 active), Execution time: mean = 200.006 us, total = 251.207 ms, Queueing time: mean = 912.102 us, max = 294.327 ms, min = 4.262 us, total = 1.146 s
[state-dump] NodeManagerService.grpc_server.ReturnWorker - 1256 total (0 active), Execution time: mean = 1.374 ms, total = 1.726 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] RayletWorkerPool.deadline_timer.kill_idle_workers - 1159 total (1 active), Execution time: mean = 21.299 us, total = 24.686 ms, Queueing time: mean = 7.262 ms, max = 228.515 ms, min = -0.000 s, total = 8.417 s
[state-dump] CoreWorkerService.grpc_client.GetCoreWorkerStats - 1051 total (1 active), Execution time: mean = 353.962 ms, total = 372.014 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] CoreWorkerService.grpc_client.GetCoreWorkerStats.OnReplyReceived - 1050 total (0 active), Execution time: mean = 21.387 us, total = 22.456 ms, Queueing time: mean = 2.497 ms, max = 68.706 ms, min = 2.531 us, total = 2.622 s
[state-dump] NodeManagerService.grpc_server.PinObjectIDs - 860 total (0 active), Execution time: mean = 2.732 ms, total = 2.349 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] NodeManagerService.grpc_server.PinObjectIDs.HandleRequestImpl - 860 total (0 active), Execution time: mean = 1.187 ms, total = 1.021 s, Queueing time: mean = 1.276 ms, max = 253.244 ms, min = 3.190 us, total = 1.097 s
[state-dump] CoreWorkerService.grpc_client.LocalGC - 678 total (1 active), Execution time: mean = 546.409 ms, total = 370.465 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] CoreWorkerService.grpc_client.LocalGC.OnReplyReceived - 677 total (0 active), Execution time: mean = 38.906 us, total = 26.340 ms, Queueing time: mean = 6.700 ms, max = 118.298 ms, min = 7.753 us, total = 4.536 s
[state-dump] CoreWorkerService.grpc_client.RestoreSpilledObjects.OnReplyReceived - 633 total (0 active), Execution time: mean = 217.989 us, total = 137.987 ms, Queueing time: mean = 1.953 ms, max = 288.106 ms, min = 5.128 us, total = 1.236 s
[state-dump] CoreWorkerService.grpc_client.RestoreSpilledObjects - 633 total (0 active), Execution time: mean = 411.531 ms, total = 260.499 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] CoreWorkerService.grpc_client.SpillObjects - 495 total (1 active), Execution time: mean = 896.493 ms, total = 443.764 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] CoreWorkerService.grpc_client.SpillObjects.OnReplyReceived - 494 total (0 active), Execution time: mean = 1.022 ms, total = 505.089 ms, Queueing time: mean = 1.549 ms, max = 253.483 ms, min = 6.983 us, total = 765.321 ms
[state-dump] Subscriber.HandlePublishedMessage_WORKER_OBJECT_EVICTION - 440 total (0 active), Execution time: mean = 105.952 us, total = 46.619 ms, Queueing time: mean = 361.962 us, max = 6.662 ms, min = 63.738 us, total = 159.263 ms
[state-dump] NodeManager.deadline_timer.spill_objects_when_over_threshold - 238 total (1 active), Execution time: mean = 45.317 us, total = 10.785 ms, Queueing time: mean = 12.556 ms, max = 69.352 ms, min = -0.000 s, total = 2.988 s
[state-dump] NodeManager.ScheduleAndDispatchTasks - 238 total (1 active), Execution time: mean = 67.709 us, total = 16.115 ms, Queueing time: mean = 12.557 ms, max = 68.726 ms, min = -0.000 s, total = 2.988 s
[state-dump] NodeManager.deadline_timer.flush_free_objects - 237 total (1 active), Execution time: mean = 532.508 us, total = 126.204 ms, Queueing time: mean = 12.544 ms, max = 68.907 ms, min = -0.000 s, total = 2.973 s
[state-dump] NodeManagerService.grpc_server.GetResourceLoad.HandleRequestImpl - 236 total (0 active), Execution time: mean = 125.813 us, total = 29.692 ms, Queueing time: mean = 762.928 us, max = 26.054 ms, min = 3.744 us, total = 180.051 ms
[state-dump] NodeManagerService.grpc_server.GetResourceLoad - 236 total (0 active), Execution time: mean = 1.210 ms, total = 285.581 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ClusterResourceManager.ResetRemoteNodeView - 80 total (1 active), Execution time: mean = 10.205 us, total = 816.398 us, Queueing time: mean = 9.133 ms, max = 50.038 ms, min = -0.000 s, total = 730.670 ms
[state-dump] NodeManager.GcsCheckAlive - 48 total (1 active), Execution time: mean = 305.942 us, total = 14.685 ms, Queueing time: mean = 8.769 ms, max = 38.017 ms, min = 1.066 ms, total = 420.932 ms
[state-dump] ray::rpc::NodeInfoGcsService.grpc_client.CheckAlive - 48 total (0 active), Execution time: mean = 1.943 ms, total = 93.286 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.record_metrics - 48 total (1 active), Execution time: mean = 431.747 us, total = 20.724 ms, Queueing time: mean = 8.663 ms, max = 37.872 ms, min = 518.322 us, total = 415.831 ms
[state-dump] ray::rpc::NodeInfoGcsService.grpc_client.CheckAlive.OnReplyReceived - 48 total (0 active), Execution time: mean = 37.073 us, total = 1.780 ms, Queueing time: mean = 599.873 us, max = 15.899 ms, min = 7.093 us, total = 28.794 ms
[state-dump] CoreWorkerService.grpc_client.DeleteSpilledObjects.OnReplyReceived - 42 total (0 active), Execution time: mean = 276.005 us, total = 11.592 ms, Queueing time: mean = 1.795 ms, max = 73.470 ms, min = 8.881 us, total = 75.380 ms
[state-dump] CoreWorkerService.grpc_client.DeleteSpilledObjects - 42 total (0 active), Execution time: mean = 196.171 ms, total = 8.239 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] NodeManagerService.grpc_server.GetNodeStats - 37 total (1 active), Execution time: mean = 8.765 s, total = 324.321 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] NodeManagerService.grpc_server.GetNodeStats.HandleRequestImpl - 37 total (0 active), Execution time: mean = 5.530 ms, total = 204.627 ms, Queueing time: mean = 1.003 ms, max = 14.421 ms, min = 7.057 us, total = 37.095 ms
[state-dump] ClientConnection.async_write.DoAsyncWrites - 32 total (0 active), Execution time: mean = 1.688 us, total = 54.003 us, Queueing time: mean = 111.398 us, max = 278.758 us, min = 55.195 us, total = 3.565 ms
[state-dump] NodeManagerService.grpc_server.GetSystemConfig.HandleRequestImpl - 30 total (0 active), Execution time: mean = 45.369 us, total = 1.361 ms, Queueing time: mean = 108.800 us, max = 1.117 ms, min = 10.800 us, total = 3.264 ms
[state-dump] NodeManagerService.grpc_server.GetSystemConfig - 30 total (0 active), Execution time: mean = 421.800 us, total = 12.654 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.debug_state_dump - 24 total (1 active), Execution time: mean = 4.164 ms, total = 99.941 ms, Queueing time: mean = 9.379 ms, max = 15.528 ms, min = 517.894 us, total = 225.103 ms
[state-dump] PeriodicalRunner.RunFnPeriodically - 12 total (0 active), Execution time: mean = 433.758 us, total = 5.205 ms, Queueing time: mean = 35.882 ms, max = 186.548 ms, min = 95.400 us, total = 430.589 ms
[state-dump] NodeManager.deadline_timer.print_event_loop_stats - 4 total (1 active, 1 running), Execution time: mean = 2.143 ms, total = 8.570 ms, Queueing time: mean = 5.806 ms, max = 14.542 ms, min = 742.547 us, total = 23.224 ms
[state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 3 total (1 active), Execution time: mean = 1.290 s, total = 3.871 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ray::rpc::JobInfoGcsService.grpc_client.AddJob.OnReplyReceived - 2 total (0 active), Execution time: mean = 107.071 us, total = 214.142 us, Queueing time: mean = 271.801 us, max = 322.534 us, min = 221.068 us, total = 543.602 us
[state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll.OnReplyReceived - 2 total (0 active), Execution time: mean = 283.927 us, total = 567.854 us, Queueing time: mean = 18.974 us, max = 22.825 us, min = 15.122 us, total = 37.947 us
[state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), Execution time: mean = 637.200 us, total = 1.274 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] RaySyncerRegister - 2 total (0 active), Execution time: mean = 4.050 us, total = 8.100 us, Queueing time: mean = 1.200 us, max = 2.100 us, min = 300.000 ns, total = 2.400 us
[state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 2 total (0 active), Execution time: mean = 198.750 us, total = 397.500 us, Queueing time: mean = 2.561 ms, max = 5.013 ms, min = 108.600 us, total = 5.122 ms
[state-dump] ray::rpc::JobInfoGcsService.grpc_client.AddJob - 2 total (0 active), Execution time: mean = 2.054 ms, total = 4.107 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 2 total (0 active), Execution time: mean = 72.809 us, total = 145.619 us, Queueing time: mean = 322.529 us, max = 428.056 us, min = 217.003 us, total = 645.059 us
[state-dump] ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 1.270 ms, total = 1.270 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 842.400 us, total = 842.400 us, Queueing time: mean = 53.300 us, max = 53.300 us, min = 53.300 us, total = 53.300 us
[state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 1.125 ms, total = 1.125 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 213.763 ms, total = 213.763 ms, Queueing time: mean = 35.800 us, max = 35.800 us, min = 35.800 us, total = 35.800 us
[state-dump] NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), Execution time: mean = 464.100 us, total = 464.100 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), Execution time: mean = 445.000 us, total = 445.000 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 233.200 us, total = 233.200 us, Queueing time: mean = 17.800 us, max = 17.800 us, min = 17.800 us, total = 17.800 us
[state-dump] ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 25.400 us, total = 25.400 us, Queueing time: mean = 16.300 us, max = 16.300 us, min = 16.300 us, total = 16.300 us
[state-dump] DebugString() time ms: 1
[state-dump]
[state-dump]
[2024-08-06 17:53:48,475 I 18208 17048] (raylet.exe) node_manager.cc:656: Sending Python GC request to 30 local workers to clean up Python cyclic references.
[2024-08-06 17:53:54,561 I 18208 17048] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 118563 MiB, 820 objects, write throughput 622 MiB/s.
[2024-08-06 17:53:54,565 I 18208 35500] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000023FA4A90000, 4294967304)
[2024-08-06 17:53:54,612 I 18208 17048] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-06 17:53:54,761 I 18208 35500] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000023EE4A80000, 3221225480)
[2024-08-06 17:53:54,932 I 18208 35500] (raylet.exe) dlmalloc.cc:288: fake_munmap(00000240A4AA0000, 4294967304)
[2024-08-06 17:53:55,124 I 18208 35500] (raylet.exe) dlmalloc.cc:288: fake_munmap(00000241A4AB0000, 8589934600)
[2024-08-06 17:53:57,310 I 18208 35500] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2221201141
[2024-08-06 17:53:57,310 I 18208 35500] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2221201141
[2024-08-06 17:53:57,678 C 18208 35500] (raylet.exe) dlmalloc.cc:129: Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1450
*** StackTrace Information ***
unknown
unknown
unknown
unknown

Again the same error. It tried to spill into the filesystem. I checked the dashboard at for memory there was more then enough space (i set _memory to 250 gigs.
dashboard said 0/250

why is this happening?
Why cant it spill?

Liquidmasl · August 6, 2024, 6:03pm

I managed to load the parquet by manually iterating over all sub parquets and loading them, then concatenating them.

I also managed to use a partition column and then load with ray.read_parquet() but that was on a way smaller file. Currently trying with a bigger one.

Generally I am running into the same issue A LOT, and I just dont understand what it means, why is it not successfully spilling?:

========== Plasma store: =================
Current usage: 4.93738 / 15.8323 GB

num bytes created total: 137208419981
1 pending objects of total size 2589MB
objects spillable: 2
bytes spillable: 2714807731
objects unsealed: 0
bytes unsealed: 0
objects in use: 1237
bytes in use: 4937380878
objects evictable: 0
bytes evictable: 0
objects created by worker: 2
bytes created by worker: 2714807731
objects restored: 1235
bytes restored: 2222573147
objects received: 0
bytes received: 0
objects errored: 0
bytes errored: 0

[2024-08-06 19:55:02,970 I 13688 25900] (raylet.exe) local_object_manager.cc:490: Restored 39314 MiB, 22889 objects, read throughput 176 MiB/s
[2024-08-06 19:55:08,763 I 13688 25900] (raylet.exe) node_manager.cc:656: Sending Python GC request to 29 local workers to clean up Python cyclic references.
[2024-08-06 19:55:09,310 I 13688 25900] (raylet.exe) local_object_manager.cc:245: :info_message:Spilled 91537 MiB, 24700 objects, write throughput 465 MiB/s.
[2024-08-06 19:55:09,321 I 13688 39132] (raylet.exe) dlmalloc.cc:288: fake_munmap(0000020D3E520000, 68719476744)
[2024-08-06 19:55:09,415 I 13688 25900] (raylet.exe) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-06 19:55:11,540 I 13688 39132] (raylet.exe) object_lifecycle_manager.cc:206: Shared memory store full, falling back to allocating from filesystem: 2714801119
[2024-08-06 19:55:11,560 C 13688 39132] (raylet.exe) dlmalloc.cc:129: Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1455

Resource stats look like this:

Topic		Replies	Views
[Dataset] Ray Dataset reading multiple parquet files with different columns crashes due to TProtocolException: Exceeded size limit Ray Data	14	1930	November 17, 2022
OOM reading "small" parquet file Ray Data	2	1249	September 1, 2022
Data loading of parquet files is very memory consuming Ray Data	2	1438	June 21, 2022
Ray worker dies when reading multiple parquet files Ray Data	3	788	November 17, 2022
Optimal cluster settings for Modin dataset creation Ray Data	1	547	January 3, 2023

Using modin with ray trying to save to and load from parquet without success. losing my mind

Related topics