Transient Errors During Job Runs #27032
Unanswered
joshua-janicas
asked this question in
Troubleshooting FAQs
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all,
I've been dealing with an odd situation with our self hosted Dagster implementation. Our container (running on Debian 12) is currently using Dagster 1.8.13. Ever since we went live into production, we've noticed that there has been a transient error that prevents one of our OPs from running, and generally a random OP each time that fails. The error is below in the collapsed block. Note that in the image the errored out OP is in grey and didn't even run.
Stack Trace
But when I run the job again, the whole flow works without issue.
Our Job flow looks like below; we are doing an ELT where we grab all the OPs needed to run to extract data from our SQL database, prioritize them, and then fan them out. The fanned out OPs are then run in parallel at max 4 at a time:
After all of the fanned out extraction jobs are done, we then run DBT (sending a slack webhook whether it is a manual or automatic running job).
During a typical run, we are using Azure's consumption plan (2 core, 4 GB) and for our incremental load it is enough power. The first block of activity is the scheduled job, and then me manually running it a few hours later.
We are upgrading to Dagster 1.9.7 soon, and with that I am beginning to use the automatic job retries which should help, but I anticipate the same transient errors to continue.
Any thoughts on this would be appreciated!
Beta Was this translation helpful? Give feedback.
All reactions