Transient Errors During Job Runs #27032

joshua-janicas · 2025-01-10T20:36:53Z

joshua-janicas
Jan 10, 2025

Hi all,

I've been dealing with an odd situation with our self hosted Dagster implementation. Our container (running on Debian 12) is currently using Dagster 1.8.13. Ever since we went live into production, we've noticed that there has been a transient error that prevents one of our OPs from running, and generally a random OP each time that fails. The error is below in the collapsed block. Note that in the image the errored out OP is in grey and didn't even run.

Stack Trace

dagster._core.errors.DagsterSubprocessError: During multiprocess execution errors occurred in child processes:
In process 10200: orjson.JSONDecodeError: unexpected end of data: line 1 column 13894255 (char 13894254)

Stack Trace:
  File "/project/.meltano/utilities/dagster/venv/lib/python3.11/site-packages/dagster/_core/executor/child_process_executor.py", line 80, in _execute_command_in_child_process
    for step_event in command.execute():
  File "/project/.meltano/utilities/dagster/venv/lib/python3.11/site-packages/dagster/_core/executor/multiprocess.py", line 89, in execute
    execution_plan = create_execution_plan(
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/project/.meltano/utilities/dagster/venv/lib/python3.11/site-packages/dagster/_core/execution/api.py", line 696, in create_execution_plan
    job_def = job.get_definition()
              ^^^^^^^^^^^^^^^^^^^^
  File "/project/.meltano/utilities/dagster/venv/lib/python3.11/site-packages/dagster/_core/definitions/reconstruct.py", line 268, in get_definition
    return check.not_none(self.get_repository_definition()).get_maybe_subset_job_def(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/.meltano/utilities/dagster/venv/lib/python3.11/site-packages/dagster/_core/definitions/reconstruct.py", line 262, in get_repository_definition
    return self.repository.get_definition()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/.meltano/utilities/dagster/venv/lib/python3.11/site-packages/dagster/_core/definitions/reconstruct.py", line 120, in get_definition
    return repository_def_from_pointer(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/.meltano/utilities/dagster/venv/lib/python3.11/site-packages/dagster/_core/definitions/reconstruct.py", line 778, in repository_def_from_pointer
    target = def_from_pointer(pointer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/.meltano/utilities/dagster/venv/lib/python3.11/site-packages/dagster/_core/definitions/reconstruct.py", line 647, in def_from_pointer
    target = pointer.load_target()
             ^^^^^^^^^^^^^^^^^^^^^
  File "/project/.meltano/utilities/dagster/venv/lib/python3.11/site-packages/dagster/_core/code_pointer.py", line 174, in load_target
    module = load_python_file(self.python_file, self.working_directory)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/.meltano/utilities/dagster/venv/lib/python3.11/site-packages/dagster/_core/code_pointer.py", line 83, in load_python_file
    return import_module_from_path(module_name, python_file)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/.meltano/utilities/dagster/venv/lib/python3.11/site-packages/dagster/_seven/__init__.py", line 46, in import_module_from_path
    spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/project/orchestrate/dagster/repository.py", line 116, in <module>
    @dbt_assets(
     ^^^^^^^^^^^
  File "/project/.meltano/utilities/dagster/venv/lib/python3.11/site-packages/dagster/_core/decorator_utils.py", line 223, in wrapped_with_context_manager_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/project/.meltano/utilities/dagster/venv/lib/python3.11/site-packages/dagster_dbt/asset_decorator.py", line 303, in dbt_assets
    manifest = validate_manifest(manifest)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/.meltano/utilities/dagster/venv/lib/python3.11/site-packages/dagster_dbt/dbt_manifest.py", line 37, in validate_manifest
    manifest = read_manifest_path(manifest.resolve())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/.meltano/utilities/dagster/venv/lib/python3.11/site-packages/dagster_dbt/dbt_manifest.py", line 26, in read_manifest_path
    return cast(Mapping[str, Any], orjson.loads(manifest_path.read_bytes()))
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/project/.meltano/utilities/dagster/venv/lib/python3.11/site-packages/dagster/_core/execution/api.py", line 748, in job_execution_iterator
    for event in job_context.executor.execute(job_context, execution_plan):
  File "/project/.meltano/utilities/dagster/venv/lib/python3.11/site-packages/dagster/_core/executor/multiprocess.py", line 340, in execute
    raise DagsterSubprocessError(

But when I run the job again, the whole flow works without issue.

Our Job flow looks like below; we are doing an ELT where we grab all the OPs needed to run to extract data from our SQL database, prioritize them, and then fan them out. The fanned out OPs are then run in parallel at max 4 at a time:

After all of the fanned out extraction jobs are done, we then run DBT (sending a slack webhook whether it is a manual or automatic running job).

During a typical run, we are using Azure's consumption plan (2 core, 4 GB) and for our incremental load it is enough power. The first block of activity is the scheduled job, and then me manually running it a few hours later.

We are upgrading to Dagster 1.9.7 soon, and with that I am beginning to use the automatic job retries which should help, but I anticipate the same transient errors to continue.

Any thoughts on this would be appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transient Errors During Job Runs #27032

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Transient Errors During Job Runs #27032

joshua-janicas Jan 10, 2025

Replies: 0 comments

joshua-janicas
Jan 10, 2025