Skip to content

Parsl jobs lost on larger workflows: parsl.executors.high_throughput.interchange.ManagerLost #51

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jluethi opened this issue Sep 14, 2022 · 4 comments

Comments

@jluethi
Copy link
Collaborator

jluethi commented Sep 14, 2022

I've been trying to run a large example through the current Fractal architecture and ran into this issue:

Jobs start (up to 4 jobs for the 10 well case). After some time, some jobs finish. Some other jobs start. They run for a few minutes. After about 10 minutes, all jobs have stopped.

It created the zarr structure, created ROI tables within them. But no image data ever gets written to the zarr file. And the server contains this error message:

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/site-packages/parsl/dataflow/dflow.py", line 797, in _unwrap_futures
    new_inputs.extend([dep.result()])
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/site-packages/parsl/dataflow/dflow.py", line 288, in handle_exec_update
    res = self._unwrap_remote_exception_wrapper(future)
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/site-packages/parsl/dataflow/dflow.py", line 481, in _unwrap_remote_exception_wrapper
    result = future.result()
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/site-packages/parsl/executors/high_throughput/executor.py", line 438, in _queue_management_worker
    s.reraise()
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/site-packages/parsl/app/errors.py", line 138, in reraise
    reraise(t, v, v.__traceback__)
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/site-packages/six.py", line 719, in reraise
    raise value
  File "/data/homes/jluethi/.conda/envs/fractal-dev/lib/python3.8/site-packages/parsl/executors/high_throughput/interchange.py", line 559, in start
    raise ManagerLost(manager_id, m['hostname'])
parsl.executors.high_throughput.interchange.ManagerLost: Task failure due to loss of manager 4b73cc5f7614 on host pelkmanslab-slurm-worker-011

Currently can't use parsl visualize, because my current installation approach doesn't seem to make that easy somehow. I'm a bit worried if we're loosing the manager, if a job gets suddenly shut down. Where would we find more relevant logs?

This is my run script for the example:

# Register user (this step will change in the future)
http POST localhost:8000/auth/register [email protected] password=test

# Define/initialize empty folder for project-related info
# (and also for the output dataset -- see below)
TMPDIR=`pwd`/tmp-proj-1
rm -r $TMPDIR
mkdir $TMPDIR

# Set useful variables
PROJECT_NAME="project_10x10"
DATASET_IN_NAME="input-ds-1"
DATASET_OUT_NAME="output-ds-1"
WORKFLOW_NAME="My worfklow 1"

# Create project
fractal project new $PROJECT_NAME $TMPDIR

INPUT_PATH=/data/active/fractal/3D/PelkmansLab/CardiacMultiplexing/Cycle1_5x5_10wells_constantZ
OUTPUT_PATH=/data/active/jluethi/Fractal/20220914_10well_5x5

# Update dataset info
fractal dataset modify-dataset $PROJECT_NAME "default" --new_dataset_name $DATASET_IN_NAME --type image --read_only true

# Add resource to dataset
fractal dataset add-resource $PROJECT_NAME $DATASET_IN_NAME ${INPUT_PATH} --glob_pattern *.png

# Add output dataset
fractal project add-dataset $PROJECT_NAME $DATASET_OUT_NAME --type zarr
fractal dataset add-resource $PROJECT_NAME $DATASET_OUT_NAME ${OUTPUT_PATH} --glob_pattern *.zarr


# Create workflow
fractal task new "$WORKFLOW_NAME" workflow image zarr

# Add subtasks (with args, if needed)
echo "{\"num_levels\": 5, \"coarsening_xy\": 2, \"channel_parameters\": {\"A01_C01\": {\"label\": \"DAPI\",\"colormap\": \"00FFFF\",\"start\": 110,\"end\": 800 }, \"A01_C02\": {\"label\": \"nanog\",\"colormap\": \"FF00FF\",\"start\": 110,\"end\": 290 }, \"A02_C03\": {\"label\": \"Lamin B1\",\"colormap\": \"FFFF00\",\"start\": 110,\"end\": 1600 }}}" > ${TMPDIR}/args_create.json
fractal task add-subtask "$WORKFLOW_NAME" "Create OME-ZARR structure" --args_json ${TMPDIR}/args_create.json

echo "{\"parallelization_level\" : \"well\", \"executor\": \"cpu\"}" > ${TMPDIR}/args_yoko.json
fractal task add-subtask "$WORKFLOW_NAME" "Yokogawa to Zarr" --args_json ${TMPDIR}/args_yoko.json

echo "{\"parallelization_level\" : \"well\", \"labeling_level\": 1, \"labeling_channel\": \"A01_C01\", \"executor\": \"gpu\"}" > ${TMPDIR}/args_labeling.json
fractal task add-subtask "$WORKFLOW_NAME" "Per-FOV image labeling" --args_json ${TMPDIR}/args_labeling.json

fractal task add-subtask "$WORKFLOW_NAME" "Replicate Zarr structure"
echo "{\"parallelization_level\" : \"well\", \"executor\": \"cpu\"}" > ${TMPDIR}/args_mip.json
fractal task add-subtask "$WORKFLOW_NAME" "Maximum Intensity Projection" --args_json ${TMPDIR}/args_mip.json

echo "{\"parallelization_level\" : \"well\", \"labeling_level\": 2, \"labeling_channel\": \"A01_C01\", \"executor\": \"gpu\"}" > ${TMPDIR}/args_whole_well_labeling.json
fractal task add-subtask "$WORKFLOW_NAME" "Whole-well image labeling" --args_json ${TMPDIR}/args_whole_well_labeling.json

echo "{\"parallelization_level\" : \"well\", \"level\": 0, \"table_name\": \"nuclei\", \"executor\": \"cpu\", \"workflow_file\": \"/data/homes/jluethi/fractal_3repo/fractal/examples/05_10x10_test_constant_z/regionprops_from_existing_labels_feature.yaml\"}" > ${TMPDIR}/args_measurement.json
fractal task add-subtask "$WORKFLOW_NAME" "Measurement" --args_json ${TMPDIR}/args_measurement.json

# Apply workflow
fractal workflow apply $PROJECT_NAME $DATASET_IN_NAME "$WORKFLOW_NAME" --output_dataset_name $DATASET_OUT_NAME

@tcompa
Copy link
Collaborator

tcompa commented Sep 15, 2022

Thanks for reporting this.

I'm rerunning your example, up to the yokogawa_to_zarr task, and I can reproduce the error.
It's likely a memory error, as I see in two places:

  1. In the monitoring (see graph below)
  2. In some of the (admittedly hidden) parsl logs:
$ grep memory server/runinfo/000/submit_scripts/*stderr
server/runinfo/000/submit_scripts/parsl.slurm.1663225286.5581121.submit.stderr:0: slurmstepd: Step 9321508.0 exceeded virtual memory limit (71780116 > 64674129), being killed
server/runinfo/000/submit_scripts/parsl.slurm.1663225286.5581121.submit.stderr:slurmstepd: Exceeded job memory limit

Screenshot from 2022-09-15 09-26-19

Work on this issue is moved to the tasks repo: fractal-analytics-platform/fractal-tasks-core#72

@jluethi
Copy link
Collaborator Author

jluethi commented Sep 15, 2022

Ah ok. So then, server-side, would be important that these types of messages go to some log where the user knows to look. This was in parsl.log?
And the actual issue then is on the task side, thanks for opening it there! :)

@tcompa
Copy link
Collaborator

tcompa commented Sep 15, 2022

It's not parsl.log, but it's within the SLURM logs (as in fractal->parsl->slurm). Then parsl fails "badly", because of a slurm error, and the ManagerLost is not very informative.

Those logs are located in paths like

server/runinfo/000/submit_scripts/parsl.slurm.1663225286.5581121.submit.stderr

@tcompa
Copy link
Collaborator

tcompa commented Sep 15, 2022

The issue of harmonizing logs is broader than this single case, so I'm closing this issue and we should come back to this topic later on with some more organized plans.

But I think that we first need to consolidate our choices on parsl executors, before entering monitoring/logs organization. The current choice seems to work (up to task errors, of course), but we still need to make sure that we are happy with it. Tests (this one included) will help us to say so (or not).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants