%matplotlib inline
import pandas as pd
import socket
host = socket.getfqdn()
from core import load, zoom, calc, save,plots,monitor
#reload funcs after updating ./core/*.py
import importlib
importlib.reload(load)
importlib.reload(zoom)
importlib.reload(calc)
importlib.reload(save)
importlib.reload(plots)
importlib.reload(monitor)
<module 'core.monitor' from '/ccc/work/cont003/gen7420/odakatin/monitor-sedna/notebook/core/monitor.py'>
If you submit the job with job scheduler; below are list of enviroment variable one can pass
local : if True run dask local cluster, if not true, put number of workers setted in the 'local' if no 'local ' given, local will be setted automatically to 'True'
%env ychunk='2', #%env tchunk='2'
controls chunk. 'False' sets no modification from original netcdf file's chunk.
ychunk=10 will group the original netcdf file to 10 by 10
tchunk=1 will chunk the time coordinate one by one
%env file_exp=
'file_exp': Which 'experiment' name is it? this corresopnds to intake catalog name without path and .yaml
#%env year=
for Validation, this correspoinds to path/year/month 's year for monitoring, this corresponids to 'date' having means do all files in the monitoring directory setting it as 0[0-9] &1[0-9]& [2-3][0-9], the job can be separated in three lots. For DELTA experiment, year corresponds to really 'year'
%env month=
for monitoring this corresponds to file path path-XIOS.{month}/
For DELTA experiment, year corresponds to really 'month'
proceed saving? True or False , Default is setted as True
proceed plotting? True or False , Default is setted as True
proceed computation? or just load computed result? True or False , Default is setted as True
save output file used for plotting
using kerchunked file -> False, not using kerhcunk -> True
name of control file to be used for computation/plots/save/ We have number of M_xxx.csv
Monitor.sh calls M_MLD_2D
and AWTD.sh, Fluxnet.sh, Siconc.sh, IceClim.sh, FWC_SSH.sh, Integrals.sh , Sections.sh
M_AWTMD
M_Fluxnet
M_Ice_quantities
M_IceClim M_IceConce M_IceThick
M_FWC_2D M_FWC_integrals M_FWC_SSH M_SSH_anomaly
M_Mean_temp_velo M_Mooring
M_Sectionx M_Sectiony
%%time
# 'savefig': Do we save output in html? or not. keep it true.
savefig=True
client,cluster,control,catalog_url,month,year,daskreport,outputpath = load.set_control(host)
!mkdir -p $outputpath
!mkdir -p $daskreport
client
local True using host= irene4188.c-irene.mg1.tgcc.ccc.cea.fr starting dask cluster on local= True workers 16 10000000000 rome local cluster starting This code is running on irene4188.c-irene.mg1.tgcc.ccc.cea.fr using SEDNA_DELTA_MONITOR file experiment, read from ../lib/SEDNA_DELTA_MONITOR.yaml on year= 2012 on month= 04 outputpath= ../results/SEDNA_DELTA_MONITOR/ daskreport= ../results/dask/6475988irene4188.c-irene.mg1.tgcc.ccc.cea.fr_SEDNA_DELTA_MONITOR_04M_FWC_2D/ CPU times: user 593 ms, sys: 143 ms, total: 737 ms Wall time: 21.7 s
Client-901dbf83-196e-11ed-8f41-080038b9321d
Connection method: Cluster object | Cluster type: distributed.LocalCluster |
Dashboard: http://127.0.0.1:8787/status |
c0212c36
Dashboard: http://127.0.0.1:8787/status | Workers: 16 |
Total threads: 128 | Total memory: 251.06 GiB |
Status: running | Using processes: True |
Scheduler-9a7506ae-6a0c-4573-9e18-35d2c59a4084
Comm: tcp://127.0.0.1:45810 | Workers: 16 |
Dashboard: http://127.0.0.1:8787/status | Total threads: 128 |
Started: Just now | Total memory: 251.06 GiB |
Comm: tcp://127.0.0.1:45644 | Total threads: 8 |
Dashboard: http://127.0.0.1:40137/status | Memory: 15.69 GiB |
Nanny: tcp://127.0.0.1:45849 | |
Local directory: /tmp/dask-worker-space/worker-128oplkc |
Comm: tcp://127.0.0.1:43026 | Total threads: 8 |
Dashboard: http://127.0.0.1:34254/status | Memory: 15.69 GiB |
Nanny: tcp://127.0.0.1:36352 | |
Local directory: /tmp/dask-worker-space/worker-47hn__ya |
Comm: tcp://127.0.0.1:40294 | Total threads: 8 |
Dashboard: http://127.0.0.1:33175/status | Memory: 15.69 GiB |
Nanny: tcp://127.0.0.1:40831 | |
Local directory: /tmp/dask-worker-space/worker-4nb8yzbs |
Comm: tcp://127.0.0.1:44301 | Total threads: 8 |
Dashboard: http://127.0.0.1:36260/status | Memory: 15.69 GiB |
Nanny: tcp://127.0.0.1:41866 | |
Local directory: /tmp/dask-worker-space/worker-ag8nfu_h |
Comm: tcp://127.0.0.1:36768 | Total threads: 8 |
Dashboard: http://127.0.0.1:42745/status | Memory: 15.69 GiB |
Nanny: tcp://127.0.0.1:36336 | |
Local directory: /tmp/dask-worker-space/worker-2vantib1 |
Comm: tcp://127.0.0.1:36506 | Total threads: 8 |
Dashboard: http://127.0.0.1:39084/status | Memory: 15.69 GiB |
Nanny: tcp://127.0.0.1:39841 | |
Local directory: /tmp/dask-worker-space/worker-_rs5mr4a |
Comm: tcp://127.0.0.1:38411 | Total threads: 8 |
Dashboard: http://127.0.0.1:39716/status | Memory: 15.69 GiB |
Nanny: tcp://127.0.0.1:44025 | |
Local directory: /tmp/dask-worker-space/worker-ohzmyu2t |
Comm: tcp://127.0.0.1:43394 | Total threads: 8 |
Dashboard: http://127.0.0.1:43963/status | Memory: 15.69 GiB |
Nanny: tcp://127.0.0.1:36128 | |
Local directory: /tmp/dask-worker-space/worker-dtlxxt_n |
Comm: tcp://127.0.0.1:37095 | Total threads: 8 |
Dashboard: http://127.0.0.1:33995/status | Memory: 15.69 GiB |
Nanny: tcp://127.0.0.1:44147 | |
Local directory: /tmp/dask-worker-space/worker-65uk9_0p |
Comm: tcp://127.0.0.1:42872 | Total threads: 8 |
Dashboard: http://127.0.0.1:40785/status | Memory: 15.69 GiB |
Nanny: tcp://127.0.0.1:32981 | |
Local directory: /tmp/dask-worker-space/worker-ac_kxeuf |
Comm: tcp://127.0.0.1:39929 | Total threads: 8 |
Dashboard: http://127.0.0.1:45427/status | Memory: 15.69 GiB |
Nanny: tcp://127.0.0.1:36895 | |
Local directory: /tmp/dask-worker-space/worker-157e9quh |
Comm: tcp://127.0.0.1:43124 | Total threads: 8 |
Dashboard: http://127.0.0.1:36466/status | Memory: 15.69 GiB |
Nanny: tcp://127.0.0.1:45513 | |
Local directory: /tmp/dask-worker-space/worker-1_dx9itn |
Comm: tcp://127.0.0.1:32919 | Total threads: 8 |
Dashboard: http://127.0.0.1:36405/status | Memory: 15.69 GiB |
Nanny: tcp://127.0.0.1:41753 | |
Local directory: /tmp/dask-worker-space/worker-8nmmflqc |
Comm: tcp://127.0.0.1:39315 | Total threads: 8 |
Dashboard: http://127.0.0.1:39158/status | Memory: 15.69 GiB |
Nanny: tcp://127.0.0.1:35461 | |
Local directory: /tmp/dask-worker-space/worker-_r_rdf9a |
Comm: tcp://127.0.0.1:36041 | Total threads: 8 |
Dashboard: http://127.0.0.1:37882/status | Memory: 15.69 GiB |
Nanny: tcp://127.0.0.1:42876 | |
Local directory: /tmp/dask-worker-space/worker-e9t1770o |
Comm: tcp://127.0.0.1:44210 | Total threads: 8 |
Dashboard: http://127.0.0.1:40233/status | Memory: 15.69 GiB |
Nanny: tcp://127.0.0.1:39461 | |
Local directory: /tmp/dask-worker-space/worker-e_muelge |
df=load.controlfile(control)
#Take out 'later' tagged computations
#df=df[~df['Value'].str.contains('later')]
df
Value | Inputs | Equation | Zone | Plot | Colourmap | MinMax | Unit | Oldname | Unnamed: 10 | |
---|---|---|---|---|---|---|---|---|---|---|
FWC_2D | gridS.vosaline,param.mask,param.e3t,param.e1te2t | calc.FWC2D_UFUNC(data) | BBFG | maps | Spectral_r | (0,24) | m | S-1 |
Each computation consists of
%%time
import os
calcswitch=os.environ.get('calc', 'True')
lazy=os.environ.get('lazy','False' )
loaddata=((df.Inputs != '').any())
print('calcswitch=',calcswitch,'df.Inputs != nothing',loaddata, 'lazy=',lazy)
data = load.datas(catalog_url,df.Inputs,month,year,daskreport,lazy=lazy) if ((calcswitch=='True' )*loaddata) else 0
data
calcswitch= True df.Inputs != nothing True lazy= False ../lib/SEDNA_DELTA_MONITOR.yaml using param_xios reading ../lib/SEDNA_DELTA_MONITOR.yaml using param_xios reading <bound method DataSourceBase.describe of sources: param_xios: args: combine: nested concat_dim: y urlpath: /ccc/work/cont003/gen7420/odakatin/CONFIGS/SEDNA/SEDNA-I/SEDNA_Domain_cfg_Tgt_20210423_tsh10m_L1/param_f32/x_*.nc xarray_kwargs: compat: override coords: minimal data_vars: minimal parallel: true description: SEDNA NEMO parameters from MPI output nav_lon lat fails driver: intake_xarray.netcdf.NetCDFSource metadata: catalog_dir: /ccc/work/cont003/gen7420/odakatin/monitor-sedna/notebook/../lib/ > {'name': 'param_xios', 'container': 'xarray', 'plugin': ['netcdf'], 'driver': ['netcdf'], 'description': 'SEDNA NEMO parameters from MPI output nav_lon lat fails', 'direct_access': 'forbid', 'user_parameters': [{'name': 'path', 'description': 'file coordinate', 'type': 'str', 'default': '/ccc/work/cont003/gen7420/odakatin/CONFIGS/SEDNA/MESH/SEDNA_mesh_mask_Tgt_20210423_tsh10m_L1/param'}], 'metadata': {}, 'args': {'urlpath': '/ccc/work/cont003/gen7420/odakatin/CONFIGS/SEDNA/SEDNA-I/SEDNA_Domain_cfg_Tgt_20210423_tsh10m_L1/param_f32/x_*.nc', 'combine': 'nested', 'concat_dim': 'y'}} 0 read gridS ['vosaline'] lazy= False using load_data_xios_kerchunk reading gridS using load_data_xios_kerchunk reading <bound method DataSourceBase.describe of sources: data_xios_kerchunk: args: consolidated: false storage_options: fo: file:////ccc/cont003/home/ra5563/ra5563/catalogue/DELTA/201204/gridS_0[0-5][0-9][0-9].json target_protocol: file urlpath: reference:// description: CREG025 NEMO outputs from different xios server in kerchunk format driver: intake_xarray.xzarr.ZarrSource metadata: catalog_dir: /ccc/work/cont003/gen7420/odakatin/monitor-sedna/notebook/../lib/ > took 38.9096245765686 seconds 0 merging gridS ['vosaline'] param mask will be included in data param nav_lat will be included in data param e3t will be included in data param nav_lon will be included in data param mask2d will be included in data param e1te2t will be included in data ychunk= 10 calldatas_y_rechunk sum_num (13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12) start rechunking with (130, 122, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 48) end of y_rechunk CPU times: user 21.2 s, sys: 3.25 s, total: 24.4 s Wall time: 1min 2s
<xarray.Dataset> Dimensions: (t: 30, z: 150, y: 6540, x: 6560) Coordinates: * t (t) object 2012-04-01 12:00:00 ... 2012-04-30 12:00:00 * y (y) int64 1 2 3 4 5 6 7 8 ... 6534 6535 6536 6537 6538 6539 6540 * x (x) int64 1 2 3 4 5 6 7 8 ... 6554 6555 6556 6557 6558 6559 6560 * z (z) int64 1 2 3 4 5 6 7 8 9 ... 143 144 145 146 147 148 149 150 mask (z, y, x) bool dask.array<chunksize=(150, 130, 6560), meta=np.ndarray> nav_lat (y, x) float32 dask.array<chunksize=(130, 6560), meta=np.ndarray> e3t (z, y, x) float64 dask.array<chunksize=(150, 130, 6560), meta=np.ndarray> nav_lon (y, x) float32 dask.array<chunksize=(130, 6560), meta=np.ndarray> mask2d (y, x) bool dask.array<chunksize=(130, 6560), meta=np.ndarray> e1te2t (y, x) float64 dask.array<chunksize=(130, 6560), meta=np.ndarray> Data variables: vosaline (t, z, y, x) float32 dask.array<chunksize=(1, 150, 130, 6560), meta=np.ndarray> Attributes: (12/26) CASE: DELTA CONFIG: SEDNA Conventions: CF-1.6 DOMAIN_dimensions_ids: [2, 3] DOMAIN_halo_size_end: [0, 0] DOMAIN_halo_size_start: [0, 0] ... ... nj: 13 output_frequency: 1d start_date: 20090101 timeStamp: 2022-Jan-21 08:38:37 GMT title: ocean T grid variables uuid: d277f069-4681-4bdc-a897-fbf6d4f734e8
%%time
monitor.auto(df,data,savefig,daskreport,outputpath,file_exp='SEDNA'
)
#calc= True #save= True #plot= False Value='FWC_2D' Zone='BBFG' Plot='maps' cmap='Spectral_r' clabel='m' clim= (0, 24) outputpath='../results/SEDNA_DELTA_MONITOR/' nc_outputpath='../nc_results/SEDNA_DELTA_MONITOR/' filename='SEDNA_maps_BBFG_FWC_2D' data=monitor.optimize_dataset(data) #2 Zooming Data data= zoom.BBFG(data) data=monitor.optimize_dataset(data)
<xarray.Dataset> Dimensions: (t: 30, z: 150, y: 5264, x: 6560) Coordinates: * t (t) object 2012-04-01 12:00:00 ... 2012-04-30 12:00:00 * y (y) int64 1277 1278 1279 1280 1281 ... 6536 6537 6538 6539 6540 * x (x) int64 1 2 3 4 5 6 7 8 ... 6554 6555 6556 6557 6558 6559 6560 * z (z) int64 1 2 3 4 5 6 7 8 9 ... 143 144 145 146 147 148 149 150 mask (z, y, x) bool dask.array<chunksize=(150, 56, 6560), meta=np.ndarray> nav_lat (y, x) float32 dask.array<chunksize=(56, 6560), meta=np.ndarray> e3t (z, y, x) float64 dask.array<chunksize=(150, 56, 6560), meta=np.ndarray> nav_lon (y, x) float32 dask.array<chunksize=(56, 6560), meta=np.ndarray> mask2d (y, x) bool dask.array<chunksize=(56, 6560), meta=np.ndarray> e1te2t (y, x) float64 dask.array<chunksize=(56, 6560), meta=np.ndarray> Data variables: vosaline (t, z, y, x) float32 dask.array<chunksize=(1, 150, 56, 6560), meta=np.ndarray> Attributes: (12/26) CASE: DELTA CONFIG: SEDNA Conventions: CF-1.6 DOMAIN_dimensions_ids: [2, 3] DOMAIN_halo_size_end: [0, 0] DOMAIN_halo_size_start: [0, 0] ... ... nj: 13 output_frequency: 1d start_date: 20090101 timeStamp: 2022-Jan-21 08:38:37 GMT title: ocean T grid variables uuid: d277f069-4681-4bdc-a897-fbf6d4f734e8
#3 Start computing data= calc.FWC2D_UFUNC(data) monitor.optimize_dataset(data) add optimise here once otimise can recognise
<xarray.Dataset> Dimensions: (t: 30, y: 5264, x: 6560) Coordinates: * t (t) object 2012-04-01 12:00:00 ... 2012-04-30 12:00:00 * y (y) int64 1277 1278 1279 1280 1281 ... 6536 6537 6538 6539 6540 * x (x) int64 1 2 3 4 5 6 7 8 ... 6554 6555 6556 6557 6558 6559 6560 nav_lat (y, x) float32 dask.array<chunksize=(56, 6560), meta=np.ndarray> nav_lon (y, x) float32 dask.array<chunksize=(56, 6560), meta=np.ndarray> mask2d (y, x) bool dask.array<chunksize=(56, 6560), meta=np.ndarray> e1te2t (y, x) float64 dask.array<chunksize=(56, 6560), meta=np.ndarray> Data variables: FWC2D (t, y, x) float32 dask.array<chunksize=(1, 56, 6560), meta=np.ndarray>
#4 Saving SEDNA_maps_BBFG_FWC_2D data=save.datas(data,plot=Plot,path=nc_outputpath,filename=filename) start saving data saving data in a file t (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 slice(0, 1, None)
2022-08-11 14:13:06,931 - distributed.worker_memory - WARNING - Worker tcp://127.0.0.1:43394 (pid=249772) exceeded 99% memory budget. Restarting... 2022-08-11 14:13:08,103 - distributed.nanny - WARNING - Restarting worker 2022-08-11 14:13:08,193 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:43394 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 233, in read n = await stream.read_into(chunk) tornado.iostream.StreamClosedError: Stream is closed The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2705, in _get_data response = await send_recv( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 918, in send_recv response = await comm.read(deserializers=deserializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:42336 remote=tcp://127.0.0.1:43394>: Stream is closed
slice(1, 2, None)
2022-08-11 14:15:29,560 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:33234 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 264, in write async def write(self, msg, serializers=None, on_error="message"): asyncio.exceptions.CancelledError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/asyncio/tasks.py", line 418, in wait_for return fut.result() asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/core.py", line 329, in connect await asyncio.wait_for(comm.write(local_info), time_left()) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/asyncio/tasks.py", line 420, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2702, in _get_data comm = await rpc.connect(worker) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 1371, in connect return await connect_attempt File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 1307, in _connect comm = await connect( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/core.py", line 333, in connect raise OSError( OSError: Timed out during handshake while connecting to tcp://127.0.0.1:33234 after 30 s 2022-08-11 14:15:31,234 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:43026 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 264, in write async def write(self, msg, serializers=None, on_error="message"): asyncio.exceptions.CancelledError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/asyncio/tasks.py", line 418, in wait_for return fut.result() asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/core.py", line 329, in connect await asyncio.wait_for(comm.write(local_info), time_left()) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/asyncio/tasks.py", line 420, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2702, in _get_data comm = await rpc.connect(worker) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 1371, in connect return await connect_attempt File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 1307, in _connect comm = await connect( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/core.py", line 333, in connect raise OSError( OSError: Timed out during handshake while connecting to tcp://127.0.0.1:43026 after 30 s
slice(2, 3, None) slice(3, 4, None) slice(4, 5, None) slice(5, 6, None) slice(6, 7, None) slice(7, 8, None) slice(8, 9, None) slice(9, 10, None) slice(10, 11, None) slice(11, 12, None) slice(12, 13, None) slice(13, 14, None) slice(14, 15, None) slice(15, 16, None) slice(16, 17, None)
2022-08-11 14:36:09,629 - distributed.worker_memory - WARNING - Worker tcp://127.0.0.1:32919 (pid=249759) exceeded 99% memory budget. Restarting... 2022-08-11 14:36:10,519 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:32919 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 233, in read n = await stream.read_into(chunk) tornado.iostream.StreamClosedError: Stream is closed The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2705, in _get_data response = await send_recv( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 918, in send_recv response = await comm.read(deserializers=deserializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:45808 remote=tcp://127.0.0.1:32919>: Stream is closed 2022-08-11 14:36:10,597 - distributed.nanny - WARNING - Restarting worker 2022-08-11 14:36:10,617 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:32919 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 233, in read n = await stream.read_into(chunk) tornado.iostream.StreamClosedError: Stream is closed The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2705, in _get_data response = await send_recv( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 918, in send_recv response = await comm.read(deserializers=deserializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:45680 remote=tcp://127.0.0.1:32919>: Stream is closed
slice(17, 18, None) slice(18, 19, None) slice(19, 20, None) slice(20, 21, None)
2022-08-11 14:42:11,130 - distributed.worker_memory - WARNING - Worker tcp://127.0.0.1:44301 (pid=249765) exceeded 99% memory budget. Restarting... 2022-08-11 14:42:11,990 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:42872 -> tcp://127.0.0.1:44301 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 223, in read frames_nbytes = await stream.read_bytes(fmt_size) tornado.iostream.StreamClosedError: Stream is closed The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1692, in get_data response = await comm.read(deserializers=serializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:42872 remote=tcp://127.0.0.1:36358>: Stream is closed 2022-08-11 14:42:11,990 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:44301 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 233, in read n = await stream.read_into(chunk) tornado.iostream.StreamClosedError: Stream is closed The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2705, in _get_data response = await send_recv( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 918, in send_recv response = await comm.read(deserializers=deserializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:44332 remote=tcp://127.0.0.1:44301>: Stream is closed 2022-08-11 14:42:11,990 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:44301 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 233, in read n = await stream.read_into(chunk) tornado.iostream.StreamClosedError: Stream is closed The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2705, in _get_data response = await send_recv( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 918, in send_recv response = await comm.read(deserializers=deserializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:44192 remote=tcp://127.0.0.1:44301>: Stream is closed 2022-08-11 14:42:11,990 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:44301 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2705, in _get_data response = await send_recv( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 918, in send_recv response = await comm.read(deserializers=deserializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:44064 remote=tcp://127.0.0.1:44301>: ConnectionResetError: [Errno 104] Connection reset by peer 2022-08-11 14:42:12,025 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:39929 -> tcp://127.0.0.1:44301 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1692, in get_data response = await comm.read(deserializers=serializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:39929 remote=tcp://127.0.0.1:57116>: ConnectionResetError: [Errno 104] Connection reset by peer 2022-08-11 14:42:12,026 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:44301 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 233, in read n = await stream.read_into(chunk) tornado.iostream.StreamClosedError: Stream is closed The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2705, in _get_data response = await send_recv( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 918, in send_recv response = await comm.read(deserializers=deserializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:44498 remote=tcp://127.0.0.1:44301>: Stream is closed 2022-08-11 14:42:12,054 - distributed.nanny - WARNING - Restarting worker 2022-08-11 14:42:12,456 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:39315 -> tcp://127.0.0.1:44301 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/tornado/iostream.py", line 971, in _handle_write num_bytes = self.write_to_fd(self._write_buffer.peek(size)) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/tornado/iostream.py", line 1148, in write_to_fd return self.socket.send(data) # type: ignore ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1692, in get_data response = await comm.read(deserializers=serializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:39315 remote=tcp://127.0.0.1:42062>: ConnectionResetError: [Errno 104] Connection reset by peer 2022-08-11 14:42:14,170 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:33234 -> tcp://127.0.0.1:44301 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1692, in get_data response = await comm.read(deserializers=serializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:33234 remote=tcp://127.0.0.1:49238>: ConnectionResetError: [Errno 104] Connection reset by peer 2022-08-11 14:42:14,170 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:44301 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 233, in read n = await stream.read_into(chunk) tornado.iostream.StreamClosedError: Stream is closed The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2705, in _get_data response = await send_recv( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 918, in send_recv response = await comm.read(deserializers=deserializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:44360 remote=tcp://127.0.0.1:44301>: Stream is closed 2022-08-11 14:42:13,877 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:43124 -> tcp://127.0.0.1:44301 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/tornado/iostream.py", line 971, in _handle_write num_bytes = self.write_to_fd(self._write_buffer.peek(size)) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/tornado/iostream.py", line 1148, in write_to_fd return self.socket.send(data) # type: ignore BrokenPipeError: [Errno 32] Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1692, in get_data response = await comm.read(deserializers=serializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:43124 remote=tcp://127.0.0.1:49278>: BrokenPipeError: [Errno 32] Broken pipe 2022-08-11 14:42:13,805 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:44301 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 233, in read n = await stream.read_into(chunk) tornado.iostream.StreamClosedError: Stream is closed The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2705, in _get_data response = await send_recv( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 918, in send_recv response = await comm.read(deserializers=deserializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:44134 remote=tcp://127.0.0.1:44301>: Stream is closed 2022-08-11 14:42:16,265 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:44210 -> tcp://127.0.0.1:44301 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1692, in get_data response = await comm.read(deserializers=serializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:44210 remote=tcp://127.0.0.1:56292>: ConnectionResetError: [Errno 104] Connection reset by peer 2022-08-11 14:42:16,616 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:40294 -> tcp://127.0.0.1:44301 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1692, in get_data response = await comm.read(deserializers=serializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:40294 remote=tcp://127.0.0.1:40656>: ConnectionResetError: [Errno 104] Connection reset by peer
slice(21, 22, None) slice(22, 23, None) slice(23, 24, None)
2022-08-11 14:47:19,106 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:42872 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 223, in read frames_nbytes = await stream.read_bytes(fmt_size) asyncio.exceptions.CancelledError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/core.py", line 328, in connect handshake = await asyncio.wait_for(comm.read(), time_left()) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2702, in _get_data comm = await rpc.connect(worker) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 1371, in connect return await connect_attempt File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 1307, in _connect comm = await connect( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/core.py", line 333, in connect raise OSError( OSError: Timed out during handshake while connecting to tcp://127.0.0.1:42872 after 30 s
slice(24, 25, None) slice(25, 26, None) slice(26, 27, None) slice(27, 28, None) slice(28, 29, None)
2022-08-11 14:53:51,829 - distributed.worker_memory - WARNING - Worker tcp://127.0.0.1:39929 (pid=249739) exceeded 99% memory budget. Restarting... 2022-08-11 14:53:52,684 - distributed.nanny - WARNING - Restarting worker 2022-08-11 14:53:52,715 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:44210 -> tcp://127.0.0.1:39929 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1692, in get_data response = await comm.read(deserializers=serializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:44210 remote=tcp://127.0.0.1:56632>: ConnectionResetError: [Errno 104] Connection reset by peer 2022-08-11 14:53:56,822 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:39929 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 233, in read n = await stream.read_into(chunk) tornado.iostream.StreamClosedError: Stream is closed The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2705, in _get_data response = await send_recv( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 918, in send_recv response = await comm.read(deserializers=deserializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:57820 remote=tcp://127.0.0.1:39929>: Stream is closed 2022-08-11 14:53:57,848 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:39929 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2705, in _get_data response = await send_recv( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 918, in send_recv response = await comm.read(deserializers=deserializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:57520 remote=tcp://127.0.0.1:39929>: ConnectionResetError: [Errno 104] Connection reset by peer 2022-08-11 14:54:06,447 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:39929 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 233, in read n = await stream.read_into(chunk) tornado.iostream.StreamClosedError: Stream is closed The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2705, in _get_data response = await send_recv( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 918, in send_recv response = await comm.read(deserializers=deserializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:57578 remote=tcp://127.0.0.1:39929>: Stream is closed 2022-08-11 14:54:06,159 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:39929 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 233, in read n = await stream.read_into(chunk) tornado.iostream.StreamClosedError: Stream is closed The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2705, in _get_data response = await send_recv( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 918, in send_recv response = await comm.read(deserializers=deserializers) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:57352 remote=tcp://127.0.0.1:39929>: Stream is closed 2022-08-11 14:54:30,081 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:39929 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 449, in connect stream = await self.client.connect( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/tornado/tcpclient.py", line 275, in connect af, addr, stream = await connector.start(connect_timeout=timeout) asyncio.exceptions.CancelledError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/core.py", line 291, in connect comm = await asyncio.wait_for( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2702, in _get_data comm = await rpc.connect(worker) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 1371, in connect return await connect_attempt File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 1307, in _connect comm = await connect( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/core.py", line 317, in connect raise OSError( OSError: Timed out trying to connect to tcp://127.0.0.1:39929 after 30 s 2022-08-11 14:54:30,951 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:39929 ConnectionRefusedError: [Errno 111] Connection refused The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/core.py", line 291, in connect comm = await asyncio.wait_for( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/asyncio/tasks.py", line 445, in wait_for return fut.result() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 461, in connect convert_stream_closed_error(self, e) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc distributed.comm.core.CommClosedError: in <distributed.comm.tcp.TCPConnector object at 0x2b983925afb0>: ConnectionRefusedError: [Errno 111] Connection refused The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2702, in _get_data comm = await rpc.connect(worker) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 1371, in connect return await connect_attempt File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 1307, in _connect comm = await connect( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/core.py", line 317, in connect raise OSError( OSError: Timed out trying to connect to tcp://127.0.0.1:39929 after 30 s
slice(29, 30, None)
2022-08-11 14:56:05,008 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:42872 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 264, in write async def write(self, msg, serializers=None, on_error="message"): asyncio.exceptions.CancelledError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/asyncio/tasks.py", line 418, in wait_for return fut.result() asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/core.py", line 329, in connect await asyncio.wait_for(comm.write(local_info), time_left()) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/asyncio/tasks.py", line 420, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2702, in _get_data comm = await rpc.connect(worker) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 1371, in connect return await connect_attempt File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 1307, in _connect comm = await connect( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/core.py", line 333, in connect raise OSError( OSError: Timed out during handshake while connecting to tcp://127.0.0.1:42872 after 30 s 2022-08-11 14:56:05,016 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:45644 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 264, in write async def write(self, msg, serializers=None, on_error="message"): asyncio.exceptions.CancelledError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/asyncio/tasks.py", line 418, in wait_for return fut.result() asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/core.py", line 329, in connect await asyncio.wait_for(comm.write(local_info), time_left()) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/asyncio/tasks.py", line 420, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2702, in _get_data comm = await rpc.connect(worker) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 1371, in connect return await connect_attempt File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 1307, in _connect comm = await connect( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/core.py", line 333, in connect raise OSError( OSError: Timed out during handshake while connecting to tcp://127.0.0.1:45644 after 30 s 2022-08-11 14:56:05,016 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:36041 Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/tcp.py", line 264, in write async def write(self, msg, serializers=None, on_error="message"): asyncio.exceptions.CancelledError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/asyncio/tasks.py", line 418, in wait_for return fut.result() asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/core.py", line 329, in connect await asyncio.wait_for(comm.write(local_info), time_left()) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/asyncio/tasks.py", line 420, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep response = await get_data_from_worker( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation return await retry( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry return await coro() File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/worker.py", line 2702, in _get_data comm = await rpc.connect(worker) File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 1371, in connect return await connect_attempt File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/core.py", line 1307, in _connect comm = await connect( File "/ccc/cont003/home/ra5563/ra5563/monitor/lib/python3.10/site-packages/distributed/comm/core.py", line 333, in connect raise OSError( OSError: Timed out during handshake while connecting to tcp://127.0.0.1:36041 after 30 s
CPU times: user 7min 35s, sys: 1min 33s, total: 9min 9s Wall time: 45min 2s