Parsl jobs lost on larger workflows: parsl.executors.high_throughput.interchange.ManagerLost #51

jluethi · 2022-09-14T18:05:36Z

I've been trying to run a large example through the current Fractal architecture and ran into this issue:

Jobs start (up to 4 jobs for the 10 well case). After some time, some jobs finish. Some other jobs start. They run for a few minutes. After about 10 minutes, all jobs have stopped.

It created the zarr structure, created ROI tables within them. But no image data ever gets written to the zarr file. And the server contains this error message:

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/site-packages/parsl/dataflow/dflow.py", line 797, in _unwrap_futures
    new_inputs.extend([dep.result()])
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/site-packages/parsl/dataflow/dflow.py", line 288, in handle_exec_update
    res = self._unwrap_remote_exception_wrapper(future)
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/site-packages/parsl/dataflow/dflow.py", line 481, in _unwrap_remote_exception_wrapper
    result = future.result()
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/site-packages/parsl/executors/high_throughput/executor.py", line 438, in _queue_management_worker
    s.reraise()
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/site-packages/parsl/app/errors.py", line 138, in reraise
    reraise(t, v, v.__traceback__)
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/site-packages/six.py", line 719, in reraise
    raise value
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/site-packages/parsl/executors/high_throughput/interchange.py", line 559, in start
    raise ManagerLost(manager_id, m['hostname'])
parsl.executors.high_throughput.interchange.ManagerLost: Task failure due to loss of manager 4b73cc5f7614 on host pelkmanslab-slurm-worker-011

Currently can't use parsl visualize, because my current installation approach doesn't seem to make that easy somehow. I'm a bit worried if we're loosing the manager, if a job gets suddenly shut down. Where would we find more relevant logs?

This is my run script for the example:

# Register user (this step will change in the future)
http POST localhost:8000/auth/register [email protected] password=test

# Define/initialize empty folder for project-related info
# (and also for the output dataset -- see below)
TMPDIR=`pwd`/tmp-proj-1
rm -r $TMPDIR
mkdir $TMPDIR

# Set useful variables
PROJECT_NAME="project_10x10"
DATASET_IN_NAME="input-ds-1"
DATASET_OUT_NAME="output-ds-1"
WORKFLOW_NAME="My worfklow 1"

# Create project
fractal project new $PROJECT_NAME $TMPDIR

INPUT_PATH=/data/active/fractal/3D/PelkmansLab/CardiacMultiplexing/Cycle1_5x5_10wells_constantZ
OUTPUT_PATH=/data/active/jluethi/Fractal/20220914_10well_5x5

# Update dataset info
fractal dataset modify-dataset $PROJECT_NAME "default" --new_dataset_name $DATASET_IN_NAME --type image --read_only true

# Add resource to dataset
fractal dataset add-resource $PROJECT_NAME $DATASET_IN_NAME ${INPUT_PATH} --glob_pattern *.png

# Add output dataset
fractal project add-dataset $PROJECT_NAME $DATASET_OUT_NAME --type zarr
fractal dataset add-resource $PROJECT_NAME $DATASET_OUT_NAME ${OUTPUT_PATH} --glob_pattern *.zarr


# Create workflow
fractal task new "$WORKFLOW_NAME" workflow image zarr

# Add subtasks (with args, if needed)
echo "{\"num_levels\": 5, \"coarsening_xy\": 2, \"channel_parameters\": {\"A01_C01\": {\"label\": \"DAPI\",\"colormap\": \"00FFFF\",\"start\": 110,\"end\": 800 }, \"A01_C02\": {\"label\": \"nanog\",\"colormap\": \"FF00FF\",\"start\": 110,\"end\": 290 }, \"A02_C03\": {\"label\": \"Lamin B1\",\"colormap\": \"FFFF00\",\"start\": 110,\"end\": 1600 }}}" > ${TMPDIR}/args_create.json
fractal task add-subtask "$WORKFLOW_NAME" "Create OME-ZARR structure" --args_json ${TMPDIR}/args_create.json

echo "{\"parallelization_level\" : \"well\", \"executor\": \"cpu\"}" > ${TMPDIR}/args_yoko.json
fractal task add-subtask "$WORKFLOW_NAME" "Yokogawa to Zarr" --args_json ${TMPDIR}/args_yoko.json

echo "{\"parallelization_level\" : \"well\", \"labeling_level\": 1, \"labeling_channel\": \"A01_C01\", \"executor\": \"gpu\"}" > ${TMPDIR}/args_labeling.json
fractal task add-subtask "$WORKFLOW_NAME" "Per-FOV image labeling" --args_json ${TMPDIR}/args_labeling.json

fractal task add-subtask "$WORKFLOW_NAME" "Replicate Zarr structure"
echo "{\"parallelization_level\" : \"well\", \"executor\": \"cpu\"}" > ${TMPDIR}/args_mip.json
fractal task add-subtask "$WORKFLOW_NAME" "Maximum Intensity Projection" --args_json ${TMPDIR}/args_mip.json

echo "{\"parallelization_level\" : \"well\", \"labeling_level\": 2, \"labeling_channel\": \"A01_C01\", \"executor\": \"gpu\"}" > ${TMPDIR}/args_whole_well_labeling.json
fractal task add-subtask "$WORKFLOW_NAME" "Whole-well image labeling" --args_json ${TMPDIR}/args_whole_well_labeling.json

echo "{\"parallelization_level\" : \"well\", \"level\": 0, \"table_name\": \"nuclei\", \"executor\": \"cpu\", \"workflow_file\": \"/data/homes/jluethi/fractal_3repo/fractal/examples/05_10x10_test_constant_z/regionprops_from_existing_labels_feature.yaml\"}" > ${TMPDIR}/args_measurement.json
fractal task add-subtask "$WORKFLOW_NAME" "Measurement" --args_json ${TMPDIR}/args_measurement.json

# Apply workflow
fractal workflow apply $PROJECT_NAME $DATASET_IN_NAME "$WORKFLOW_NAME" --output_dataset_name $DATASET_OUT_NAME

The text was updated successfully, but these errors were encountered:

tcompa · 2022-09-15T07:29:46Z

Thanks for reporting this.

I'm rerunning your example, up to the yokogawa_to_zarr task, and I can reproduce the error.
It's likely a memory error, as I see in two places:

In the monitoring (see graph below)
In some of the (admittedly hidden) parsl logs:

$ grep memory server/runinfo/000/submit_scripts/*stderr
server/runinfo/000/submit_scripts/parsl.slurm.1663225286.5581121.submit.stderr:0: slurmstepd: Step 9321508.0 exceeded virtual memory limit (71780116 > 64674129), being killed
server/runinfo/000/submit_scripts/parsl.slurm.1663225286.5581121.submit.stderr:slurmstepd: Exceeded job memory limit

Work on this issue is moved to the tasks repo: fractal-analytics-platform/fractal-tasks-core#72

jluethi · 2022-09-15T07:50:30Z

Ah ok. So then, server-side, would be important that these types of messages go to some log where the user knows to look. This was in parsl.log?
And the actual issue then is on the task side, thanks for opening it there! :)

tcompa · 2022-09-15T07:53:42Z

It's not parsl.log, but it's within the SLURM logs (as in fractal->parsl->slurm). Then parsl fails "badly", because of a slurm error, and the ManagerLost is not very informative.

Those logs are located in paths like

server/runinfo/000/submit_scripts/parsl.slurm.1663225286.5581121.submit.stderr

tcompa · 2022-09-15T07:56:49Z

The issue of harmonizing logs is broader than this single case, so I'm closing this issue and we should come back to this topic later on with some more organized plans.

But I think that we first need to consolidate our choices on parsl executors, before entering monitoring/logs organization. The current choice seems to work (up to task errors, of course), but we still need to make sure that we are happy with it. Tests (this one included) will help us to say so (or not).

tcompa mentioned this issue Sep 15, 2022

Check memory usage of yokogawa_to_zarr fractal-analytics-platform/fractal-tasks-core#72

Closed

tcompa mentioned this issue Sep 15, 2022

Fix warning with ROIs dataframes fractal-analytics-platform/fractal-tasks-core#73

Closed

tcompa closed this as completed Sep 15, 2022

jluethi mentioned this issue Sep 15, 2022

Improve logging & have centralized user feedback #52

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parsl jobs lost on larger workflows: parsl.executors.high_throughput.interchange.ManagerLost #51

Parsl jobs lost on larger workflows: parsl.executors.high_throughput.interchange.ManagerLost #51

jluethi commented Sep 14, 2022

tcompa commented Sep 15, 2022

Uh oh!

jluethi commented Sep 15, 2022

Uh oh!

tcompa commented Sep 15, 2022

Uh oh!

tcompa commented Sep 15, 2022 •

edited

Loading

Uh oh!

Parsl jobs lost on larger workflows: parsl.executors.high_throughput.interchange.ManagerLost #51

Parsl jobs lost on larger workflows: parsl.executors.high_throughput.interchange.ManagerLost #51

Comments

jluethi commented Sep 14, 2022

tcompa commented Sep 15, 2022

Uh oh!

jluethi commented Sep 15, 2022

Uh oh!

tcompa commented Sep 15, 2022

Uh oh!

tcompa commented Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tcompa commented Sep 15, 2022 •

edited

Loading