The following is with two workers.
/home/ray/anaconda3/lib/python3.8/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
file-----------
main--------------
2022-06-04 13:44:57,034 INFO packaging.py:323 -- Pushing file package 'gcs://_ray_pkg_e1ea93f0c19f60918b7912b43b338386.zip' (0.09MiB) to Ray cluster...
2022-06-04 13:44:57,036 INFO packaging.py:332 -- Successfully pushed file package 'gcs://_ray_pkg_e1ea93f0c19f60918b7912b43b338386.zip'.
2022-06-04 13:44:57,051 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_13-44-57
(BaseWorkerMixin pid=4875) 2022-06-04 13:45:00,748 INFO torch.py:349 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=4874) 2022-06-04 13:45:00,749 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=2]
2022-06-04 13:45:01,817 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_13-44-57/run_001
(BaseWorkerMixin pid=4875) 1000
(BaseWorkerMixin pid=4874) 1000
(BaseWorkerMixin pid=4875) 2022-06-04 13:45:02,086 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=4875) 2022-06-04 13:45:02,087 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=4874) 2022-06-04 13:45:02,087 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=4874) 2022-06-04 13:45:02,087 INFO torch.py:135 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=4875) loss: 2.304449 [ 0/30000]
(BaseWorkerMixin pid=4874) loss: 2.304529 [ 0/30000]
(BaseWorkerMixin pid=4875) Test Error:
(BaseWorkerMixin pid=4875) Accuracy: 14.8%, Avg loss: 2.301992
(BaseWorkerMixin pid=4875)
(BaseWorkerMixin pid=4874) Test Error:
(BaseWorkerMixin pid=4874) Accuracy: 16.1%, Avg loss: 2.301196
(BaseWorkerMixin pid=4874)
(BaseWorkerMixin pid=4875) loss: 2.302403 [ 0/30000]
(BaseWorkerMixin pid=4874) loss: 2.303127 [ 0/30000]
(BaseWorkerMixin pid=4875) Test Error:
(BaseWorkerMixin pid=4875) Accuracy: 16.5%, Avg loss: 2.300077
(BaseWorkerMixin pid=4875)
(BaseWorkerMixin pid=4874) Test Error:
(BaseWorkerMixin pid=4874) Accuracy: 17.9%, Avg loss: 2.299322
(BaseWorkerMixin pid=4874)
(BaseWorkerMixin pid=4875) loss: 2.300217 [ 0/30000]
(BaseWorkerMixin pid=4874) loss: 2.301674 [ 0/30000]
(BaseWorkerMixin pid=4875) Test Error:
(BaseWorkerMixin pid=4875) Accuracy: 18.2%, Avg loss: 2.297997
(BaseWorkerMixin pid=4875)
(BaseWorkerMixin pid=4874) Test Error:
(BaseWorkerMixin pid=4874) Accuracy: 19.6%, Avg loss: 2.297340
(BaseWorkerMixin pid=4874)
(BaseWorkerMixin pid=4875) loss: 2.297815 [ 0/30000]
(BaseWorkerMixin pid=4874) loss: 2.300117 [ 0/30000]
(BaseWorkerMixin pid=4875) Test Error:
(BaseWorkerMixin pid=4875) Accuracy: 20.1%, Avg loss: 2.295787
(BaseWorkerMixin pid=4875)
(BaseWorkerMixin pid=4874) Test Error:
(BaseWorkerMixin pid=4874) Accuracy: 21.5%, Avg loss: 2.295250
(BaseWorkerMixin pid=4874)
(BaseWorkerMixin pid=4875) loss: 2.295287 [ 0/30000]
(BaseWorkerMixin pid=4874) loss: 2.298449 [ 0/30000]
(BaseWorkerMixin pid=4875) Test Error:
(BaseWorkerMixin pid=4875) Accuracy: 22.6%, Avg loss: 2.293510
(BaseWorkerMixin pid=4875)
(BaseWorkerMixin pid=4874) Test Error:
(BaseWorkerMixin pid=4874) Accuracy: 23.6%, Avg loss: 2.293094
(BaseWorkerMixin pid=4874)
(BaseWorkerMixin pid=4875) loss: 2.292696 [ 0/30000]
(BaseWorkerMixin pid=4874) loss: 2.296699 [ 0/30000]
(BaseWorkerMixin pid=4875) Test Error:
(BaseWorkerMixin pid=4875) Accuracy: 25.1%, Avg loss: 2.291205
(BaseWorkerMixin pid=4875)
(BaseWorkerMixin pid=4874) Test Error:
(BaseWorkerMixin pid=4874) Accuracy: 25.7%, Avg loss: 2.290898
(BaseWorkerMixin pid=4874)
(BaseWorkerMixin pid=4875) loss: 2.290082 [ 0/30000]
(BaseWorkerMixin pid=4874) loss: 2.294896 [ 0/30000]
(BaseWorkerMixin pid=4875) Test Error:
(BaseWorkerMixin pid=4875) Accuracy: 27.9%, Avg loss: 2.288895
(BaseWorkerMixin pid=4875)
(BaseWorkerMixin pid=4874) Test Error:
(BaseWorkerMixin pid=4874) Accuracy: 28.1%, Avg loss: 2.288690
(BaseWorkerMixin pid=4874)
(BaseWorkerMixin pid=4875) loss: 2.287466 [ 0/30000]
(BaseWorkerMixin pid=4874) loss: 2.293067 [ 0/30000]
(BaseWorkerMixin pid=4875) Test Error:
(BaseWorkerMixin pid=4875) Accuracy: 30.0%, Avg loss: 2.286576
(BaseWorkerMixin pid=4875)
(BaseWorkerMixin pid=4874) Test Error:
(BaseWorkerMixin pid=4874) Accuracy: 30.0%, Avg loss: 2.286469
(BaseWorkerMixin pid=4874)
(BaseWorkerMixin pid=4875) loss: 2.284843 [ 0/30000]
(BaseWorkerMixin pid=4874) loss: 2.291225 [ 0/30000]
(BaseWorkerMixin pid=4875) Test Error:
(BaseWorkerMixin pid=4875) Accuracy: 32.2%, Avg loss: 2.284245
(BaseWorkerMixin pid=4875)
(BaseWorkerMixin pid=4874) Test Error:
(BaseWorkerMixin pid=4874) Accuracy: 32.5%, Avg loss: 2.284235
(BaseWorkerMixin pid=4874)
(BaseWorkerMixin pid=4875) loss: 2.282205 [ 0/30000]
(BaseWorkerMixin pid=4874) loss: 2.289369 [ 0/30000]
train time ---------------- 96.40072703361511
(BaseWorkerMixin pid=4875) Test Error:
(BaseWorkerMixin pid=4875) Accuracy: 33.9%, Avg loss: 2.281895
(BaseWorkerMixin pid=4875)
(BaseWorkerMixin pid=4874) Test Error:
(BaseWorkerMixin pid=4874) Accuracy: 34.2%, Avg loss: 2.281981
(BaseWorkerMixin pid=4874)
Loss results: [[2.3011959075927733, 2.299322414398193, 2.297340440750122, 2.2952504634857176, 2.2930936336517336, 2.290898323059082, 2.288690376281738, 2.2864691734313967, 2.2842350959777833, 2.281980800628662], [2.301992082595825, 2.300076627731323, 2.29799747467041, 2.2957868576049805, 2.293509531021118, 2.291204643249512, 2.288894844055176, 2.2865764141082763, 2.284245252609253, 2.281894826889038]]
The following is with one worker.
file-----------
main--------------
2022-06-04 13:48:27,765 INFO worker.py:862 -- Using address localhost:9031 set in the environment variable RAY_ADDRESS
2022-06-04 13:48:27,806 INFO worker.py:964 -- Connecting to existing Ray cluster at address: 172.31.75.49:9031
2022-06-04 13:48:27,808 INFO worker.py:981 -- Calling ray.init() again after it has already been called.
2022-06-04 13:48:27,811 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ray/ray_results/train_2022-06-04_13-48-27
(bundle_reservation_check_func pid=4826) E0604 13:48:29.277654244 4873 chttp2_transport.cc:1103] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
2022-06-04 13:48:31,765 INFO trainer.py:249 -- Run results will be logged in: /home/ray/ray_results/train_2022-06-04_13-48-27/run_001
(BaseWorkerMixin pid=5654) 2022-06-04 13:48:31,728 INFO torch.py:349 -- Setting up process group for: env:// [rank=0, world_size=1]
(BaseWorkerMixin pid=5654) 2000
(BaseWorkerMixin pid=5654) 2022-06-04 13:48:32,029 INFO torch.py:97 -- Moving model to device: cpu
(BaseWorkerMixin pid=5654) loss: 2.305619 [ 0/60000]
(BaseWorkerMixin pid=5654) Test Error:
(BaseWorkerMixin pid=5654) Accuracy: 10.8%, Avg loss: 2.304033
(BaseWorkerMixin pid=5654)
(BaseWorkerMixin pid=5654) loss: 2.303440 [ 0/60000]
(BaseWorkerMixin pid=5654) Test Error:
(BaseWorkerMixin pid=5654) Accuracy: 12.2%, Avg loss: 2.301719
(BaseWorkerMixin pid=5654)
(BaseWorkerMixin pid=5654) loss: 2.301149 [ 0/60000]
(BaseWorkerMixin pid=5654) Test Error:
(BaseWorkerMixin pid=5654) Accuracy: 13.6%, Avg loss: 2.299365
(BaseWorkerMixin pid=5654)
(BaseWorkerMixin pid=5654) loss: 2.298802 [ 0/60000]
(BaseWorkerMixin pid=5654) Test Error:
(BaseWorkerMixin pid=5654) Accuracy: 14.9%, Avg loss: 2.297001
(BaseWorkerMixin pid=5654)
(BaseWorkerMixin pid=5654) loss: 2.296452 [ 0/60000]
(BaseWorkerMixin pid=5654) Test Error:
(BaseWorkerMixin pid=5654) Accuracy: 16.1%, Avg loss: 2.294642
(BaseWorkerMixin pid=5654)
(BaseWorkerMixin pid=5654) loss: 2.294113 [ 0/60000]
(BaseWorkerMixin pid=5654) Test Error:
(BaseWorkerMixin pid=5654) Accuracy: 17.1%, Avg loss: 2.292277
(BaseWorkerMixin pid=5654)
(BaseWorkerMixin pid=5654) loss: 2.291781 [ 0/60000]
(BaseWorkerMixin pid=5654) Test Error:
(BaseWorkerMixin pid=5654) Accuracy: 18.6%, Avg loss: 2.289894
(BaseWorkerMixin pid=5654)
(BaseWorkerMixin pid=5654) loss: 2.289440 [ 0/60000]
(BaseWorkerMixin pid=5654) Test Error:
(BaseWorkerMixin pid=5654) Accuracy: 20.3%, Avg loss: 2.287486
(BaseWorkerMixin pid=5654)
(BaseWorkerMixin pid=5654) loss: 2.287072 [ 0/60000]
(BaseWorkerMixin pid=5654) Test Error:
(BaseWorkerMixin pid=5654) Accuracy: 22.5%, Avg loss: 2.285050
(BaseWorkerMixin pid=5654)
(BaseWorkerMixin pid=5654) loss: 2.284687 [ 0/60000]
train time ---------------- 178.2441258430481
(BaseWorkerMixin pid=5654) Test Error:
(BaseWorkerMixin pid=5654) Accuracy: 24.9%, Avg loss: 2.282580
(BaseWorkerMixin pid=5654)
Loss results: [[2.304033136367798, 2.301719379425049, 2.299365425109863, 2.2970014095306395, 2.2946418285369874, 2.292277193069458, 2.289893627166748, 2.2874859809875487, 2.2850500106811524, 2.282579851150513]]
The training time is 96.40072703361511
vs 178.2441258430481
.