SIGSEGV storing json in trial_runner

Hello,

I’m running a Tuner job with Ray 2.5.1 on Ubuntu. I have been getting the following error pretty regularly (maybe 1 of 4 trials). I am saving checkpoints every 10 iterations and the final checkpoint. Since there are JSON files in checkpoint dirs, and I am not doing anything else with JSON, I have to suspect this is a checkpointing problem. Any advice on how to further investigate or how to deal with it? For the time being, I’m reducing the checkpoint activity in hopes of minimizing the odds. Thank you.

*** SIGSEGV received at time=1697670253 on cpu 6 ***
PC: @           0x50d247  (unknown)  list_iter
    @     0x7fdaf7e3a420  (unknown)  (unknown)
[2023-10-18 19:04:13,049 E 7073 7073] logging.cc:361: *** SIGSEGV received at time=1697670253 on cpu 6 ***
[2023-10-18 19:04:13,049 E 7073 7073] logging.cc:361: PC: @           0x50d247  (unknown)  list_iter
[2023-10-18 19:04:13,049 E 7073 7073] logging.cc:361:     @     0x7fdaf7e3a420  (unknown)  (unknown)
Fatal Python error: Segmentation fault

Stack (most recent call first):
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/collections/__init__.py", line 981 in __getitem__
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 733 in dump
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 88 in dumps
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/tune/utils/serialization.py", line 28 in _to_cloudpickle
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/tune/utils/serialization.py", line 23 in default
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/json/encoder.py", line 438 in _iterencode
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/json/encoder.py", line 405 in _iterencode_dict
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/json/encoder.py", line 405 in _iterencode_dict
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/json/encoder.py", line 431 in _iterencode
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/json/__init__.py", line 179 in dump
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py", line 376 in save_to_dir
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/tune/execution/experiment_state.py", line 232 in checkpoint
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py", line 491 in checkpoint
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/tune/execution/tune_controller.py", line 269 in step
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/tune/tune.py", line 1070 in run
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/tune/impl/tuner_internal.py", line 712 in _fit_internal
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/tune/impl/tuner_internal.py", line 588 in fit
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/tune/tuner.py", line 347 in fit
  File "/home/starkj/projects/cda1/staging/tune.py", line 182 in main
  File "/home/starkj/projects/cda1/staging/tune.py", line 195 in <module>

Extension modules: msgpack._cmsgpack, setproctitle, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, yaml._yaml, grpc._cython.cygrpc, ray._raylet, mkl._mklinit, mkl._py_mkl_service, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pyarrow._hdfsio, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.hashing, pandas._libs.tslib, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, _cffi_backend, tensorflow.python.framework.fast_tensor_util, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.h5r, h5py.utils, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5t, h5py._conv, h5py.h5z, h5py._proxy, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5o, h5py.h5l, h5py._selector, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.linalg._flinalg, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, PIL._imaging, scipy.ndimage._nd_image, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, _ni_label, scipy.ndimage._ni_label, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, gmpy2.gmpy2, skimage._shared.geometry, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, skimage.draw._draw, skimage.transform._hough_transform, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, numpy.linalg.lapack_lite, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize.__nnls, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy._lib._uarray._uarray, skimage.transform._warps_cy, skimage.measure._find_contours_cy, skimage.measure._marching_cubes_lewiner_cy, skimage.measure._moments_cy, scipy.signal._sigtools, scipy.signal._max_len_seq_inner, scipy.signal._upfirdn_apply, scipy.signal._spline, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.signal._sosfilt, scipy.signal._spectral, scipy.special.cython_special, scipy.stats._stats, scipy.stats.beta_ufunc, scipy.stats._boost.beta_ufunc, scipy.stats.binom_ufunc, scipy.stats._boost.binom_ufunc, scipy.stats.nbinom_ufunc, scipy.stats._boost.nbinom_ufunc, scipy.stats.hypergeom_ufunc, scipy.stats._boost.hypergeom_ufunc, scipy.stats.ncf_ufunc, scipy.stats._boost.ncf_ufunc, scipy.stats.ncx2_ufunc, scipy.stats._boost.ncx2_ufunc, scipy.stats.nct_ufunc, scipy.stats._boost.nct_ufunc, scipy.stats.skewnorm_ufunc, scipy.stats._boost.skewnorm_ufunc, scipy.stats.invgauss_ufunc, scipy.stats._boost.invgauss_ufunc, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._stats_pythran, scipy.stats._statlib, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.signal._peak_finding_utils, skimage.measure._pnpoly, skimage.measure._ccomp, skimage.transform._radon_transform, lz4._version, lz4.frame._frame, pyarrow._json (total: 228)
./train.sh: line 10:  7073 Segmentation fault      (core dumped) python -u tune.py $1 > >(tee ~/tmp/log) 2> >(tee -a ~/tmp/log) 1>&2

Update: my latest run ended with a slightly different stack dump. Still can’t get a run to go longer than ~15 hours.

Stack (most recent call first):
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/tune/utils/serialization.py", line 28 in _to_cloudpickle
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/tune/utils/serialization.py", line 23 in default
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/json/encoder.py", line 438 in _iterencode
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/json/encoder.py", line 405 in _iterencode_dict
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/json/encoder.py", line 405 in _iterencode_dict
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/json/encoder.py", line 431 in _iterencode
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/json/__init__.py", line 179 in dump
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py", line 376 in save_to_dir
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/tune/execution/experiment_state.py", line 232 in checkpoint
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py", line 491 in checkpoint
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/tune/execution/tune_controller.py", line 269 in step
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/tune/tune.py", line 1070 in run
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/tune/impl/tuner_internal.py", line 712 in _fit_internal
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/tune/impl/tuner_internal.py", line 588 in fit
  File "/home/starkj/miniconda3/envs/cda0/lib/python3.10/site-packages/ray/tune/tuner.py", line 347 in fit
  File "/home/starkj/projects/cda1/staging/tune.py", line 182 in main
  File "/home/starkj/projects/cda1/staging/tune.py", line 195 in <module>

Hi @starkj, is this on the latest version of Ray?

Also, one thing to look out for: what is in your Tuner param_space? These get pickled as part of trial metadata which could be causing this pickling error.

Hi @justinvyu . I am running Ray 2.5.1. I’m limited there for the moment because 2.6 took away my ability to run a Tuner job on my local laptop, which is all I have to work with for now. But that’s another story.

My param_space is set to the algo config, as I have modified it. The resultant config dict is below. Is there anything in particular here that might be suspicious?

///// SAC training params are:

_deterministic_loss: false
_disable_action_flattening: false
_disable_execution_plan_api: true
_disable_initialize_loss_from_dummy_batch: false
_disable_preprocessor_api: false
_enable_learner_api: false
_enable_rl_module_api: false
_fake_gpus: false
_tf_policy_handles_more_than_one_loss: false
_use_beta_distribution: false
actions_in_input_normalized: false
always_attach_evaluation_results: false
auto_wrap_old_gym_envs: true
batch_mode: complete_episodes
callbacks: <class 'cda_callbacks.CdaCallbacks'>
checkpoint_trainable_policies_only: false
clip_actions: false
compress_observations: false
create_env_on_driver: false
custom_resources_per_worker: {}
delay_between_worker_restarts_s: 60.0
disable_env_checking: false
eager_max_retraces: 20
eager_tracing: false
enable_async_evaluation: false
enable_connectors: true
enable_tf1_exec_eagerly: false
env: <class 'highway_env_wrapper.HighwayEnvWrapper'>
env_config:
  debug: 0
  episode_length: 80
  ignore_neighbor_crashes: true
  scenario: 0
  time_step_size: 0.2
  training: true
  vehicle_file: /home/starkj/projects/cda1/vehicle_config.yaml
  verify_obs: false
evaluation_duration: 10
evaluation_duration_unit: episodes
evaluation_num_workers: 0
evaluation_parallel_to_training: false
evaluation_sample_timeout_s: 180.0
exploration_config:
  final_scale: 0.1
  initial_scale: 1.0
  random_timesteps: 10000
  scale_timesteps: 12000000
  stddev: 0.25
  type: GaussianNoise
explore: true
export_native_model_files: false
extra_python_environs_for_driver: {}
extra_python_environs_for_worker: {}
fake_sampler: false
framework: torch
gamma: 0.995
grad_clip: 1.0
grad_clip_by: global_norm
horizon: -1
ignore_worker_failures: false
in_evaluation: false
initial_alpha: 0.2
input: sampler
input_config: {}
keep_per_episode_custom_metrics: false
local_gpu_idx: 0
local_tf_session_args:
  inter_op_parallelism_threads: 8
  intra_op_parallelism_threads: 8
log_level: WARN
log_sys_usage: true
lr: 0.001
max_num_worker_restarts: 1000
max_requests_in_flight_per_sampler_worker: 2
metrics_episode_collection_timeout_s: 60.0
metrics_num_episodes_for_smoothing: 100
min_sample_timesteps_per_iteration: 100
min_time_s_per_iteration: 1
min_train_timesteps_per_iteration: 0
model:
  _disable_action_flattening: false
  _disable_preprocessor_api: false
  _time_major: false
  _use_default_native_models: -1
  always_check_shapes: false
  attention_dim: 64
  attention_head_dim: 32
  attention_init_gru_gate_bias: 2.0
  attention_memory_inference: 50
  attention_memory_training: 50
  attention_num_heads: 1
  attention_num_transformer_units: 1
  attention_position_wise_mlp_dim: 32
  attention_use_n_prev_actions: 0
  attention_use_n_prev_rewards: 0
  conv_activation: relu
  conv_filters: null
  custom_action_dist: null
  custom_model: null
  custom_model_config: {}
  custom_preprocessor: null
  dim: 84
  encoder_latent_dim: null
  fcnet_activation: tanh
  fcnet_hiddens:
  - 256
  - 256
  framestack: true
  free_log_std: false
  grayscale: false
  lstm_cell_size: 256
  lstm_use_prev_action: false
  lstm_use_prev_action_reward: -1
  lstm_use_prev_reward: false
  max_seq_len: 20
  no_final_linear: false
  post_fcnet_activation: relu
  post_fcnet_hiddens: []
  use_attention: false
  use_lstm: false
  vf_share_layers: true
  zero_mean: true
multiagent:
  count_steps_by: env_steps
  observation_fn: null
  policies:
    default_policy: [null, null, null, null]
  policies_to_train: null
  policy_map_cache: -1
  policy_map_capacity: 100
  policy_mapping_fn: <function AlgorithmConfig.DEFAULT_POLICY_MAPPING_FN at 0x7f9ac1bf3250>
n_step: 1
no_done_at_end: -1
normalize_actions: true
num_consecutive_worker_failures_tolerance: 100
num_cpus_for_driver: 2
num_cpus_per_learner_worker: 1
num_cpus_per_worker: 2
num_envs_per_worker: 1
num_gpus: 0.5
num_gpus_per_learner_worker: 0
num_gpus_per_worker: 0
num_learner_workers: 0
num_steps_sampled_before_learning_starts: 1500
num_workers: 0
observation_filter: NoFilter
off_policy_estimation_methods: {}
offline_sampling: false
ope_split_batch_by_episode: true
optimization:
  actor_learning_rate: <ray.tune.search.sample.Float object at 0x7f9ac18b1840>
  critic_learning_rate: <ray.tune.search.sample.Float object at 0x7f9ac18b0040>
  entropy_learning_rate: <ray.tune.search.sample.Float object at 0x7f9ac18b3bb0>
optimizer: {}
output_compress_columns:
- obs
- new_obs
output_config: {}
output_max_file_size: 67108864
placement_strategy: PACK
policy_model_config:
  custom_model: null
  custom_model_config: {}
  fcnet_activation: relu
  fcnet_hiddens:
  - 600
  - 256
  - 128
  post_fcnet_activation: null
  post_fcnet_hiddens: []
policy_states_are_swappable: false
postprocess_inputs: false
preprocessor_pref: deepmind
q_model_config:
  custom_model: null
  custom_model_config: {}
  fcnet_activation: relu
  fcnet_hiddens:
  - 600
  - 256
  - 128
  post_fcnet_activation: null
  post_fcnet_hiddens: []
recreate_failed_workers: true
remote_env_batch_wait_ms: 0
remote_worker_envs: false
render_env: false
replay_buffer_config:
  _enable_replay_buffer_api: true
  capacity: 1000000
  prioritized_replay: true
  prioritized_replay_alpha: 0.6
  prioritized_replay_beta: 0.4
  prioritized_replay_eps: 1.0e-06
  type: MultiAgentPrioritizedReplayBuffer
  worker_side_prioritization: false
restart_failed_sub_environments: false
rollout_fragment_length: 80
sample_async: false
sample_collector: <class 'ray.rllib.evaluation.collectors.simple_list_collector.SimpleListCollector'>
seed: 17
shuffle_buffer_size: 0
simple_optimizer: -1
soft_horizon: -1
store_buffer_in_checkpoints: false
sync_filters_on_rollout_workers_timeout_s: 60.0
synchronize_filters: true
target_entropy: auto
target_network_update_freq: 0
tau: 0.005
tf_session_args:
  allow_soft_placement: true
  device_count:
    CPU: 1
  gpu_options:
    allow_growth: true
  inter_op_parallelism_threads: 2
  intra_op_parallelism_threads: 2
  log_device_placement: false
train_batch_size: 1040
twin_q: true
use_state_preprocessor: -1
validate_workers_after_construction: true
worker_health_probe_timeout_s: 60
worker_restore_timeout_s: 1800
worker_side_prioritization: -1

Thanks for your consideration.