-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request: delay restart before status FATAL #487
Comments
👍 A |
👍 this would be really useful. allowing us to configure how long between retries would be great. Allowing for an backoff would also help. i think it would be great if we could also specify a retry count of infinity (never give up) in combination with a sane value for the |
+1, would be useful. |
👍 would be great. There's already a PR with that feature #509, but it's not automergeable, and there's no discussion on it. |
+1 |
2 similar comments
+1 |
+1 |
+1 This would be very useful for me |
+1 |
2 similar comments
+1 |
+1 |
Has someone enough knowledge to create patch for this? |
I would love to see this too, so 👍 |
+1 |
1 similar comment
👍 |
Would love to have this feature. 👍 |
👍 |
Why was this ticket closed? Was the feature request implemented? I can't see a code change on this issue or on #561. |
+1 |
dear sir plz implement this alrdy k thx bai |
👍 would like this feature, thanks! |
+1, would be useful for me too! |
+1, hope this! |
+1 |
1 similar comment
+1 |
+1 for |
+1 |
1 similar comment
+1 |
Several others have mentioned a "backoff" implementation and the The mechanism that implements In other words, you can set While it might be nice to have control over the backoff computation, I concur that this has been addressed in as much as a self-inflicted DoS will not result from setting |
can anyone points to where in the docs what 0x20h describes (the "add an increasing delay for each retry") is described? thanks. |
@nvictor If it were documented, I doubt any of us would be here. ;) I discovered that the delay is increased by one second with each retry by examining |
@nvictor @vlsd Has already pointed to this documentation: http://supervisord.org/subprocess.html#process-states
I was also looking for a solution for my problem. I connect to an IMAP server and get a connection timeout. Starting the script after a little delay works great, but supervisor is too quick to restart (for this job). Therefor a delay as a configuration option would be great to prevent hacks like the 'sleep' option mentioned above (which I will try now as there is not configurable alternative) |
So it seems like linear backoff is implemented only (exponential backoff is the other common kind), and only at one, hard-coded rate (1s per retry, as per @cbj4074, but not actually documented). I no longer have a horse in the game. It seems to me like providing the users with an ability to both switch between linear and exponential backoff and set the rate at which the backoff happens would be an ideal solution here. Failing that, documenting that the backoff is linear and set at a rate of 1 second per retry is also a solution. Simply mentioning that backoff happens, with no other details, is confusing. |
I am having the FATAL state issue due to which my supervisor goes down even if the supervisor is running but the workers not processing any jobs. |
Thanks @jderusse import hashlib
import os
import time
import sys
import datetime
max_backoff = 5
timeout = 60 * 60 * 6
def log(msg):
print("[proc_wrapper %s] %s" % (str(datetime.datetime.now()), msg))
command = " ".join(sys.argv[1:])
backoff_file = hashlib.md5(command.encode("utf-8")).hexdigest()
log("Running '%s', backoff file name: %s" % (command, backoff_file))
status = os.system(command)
if status != 0:
if not os.path.exists(backoff_file):
seconds = -1
else:
with open(backoff_file) as f:
content = f.read().split(":")
seconds, last_timestamp = int(content[0]), int(content[1])
if time.time() - last_timestamp > timeout:
seconds = -1
if seconds + 1 > max_backoff:
seconds = max_backoff
else:
seconds = seconds + 1
with open(backoff_file, "w") as f:
f.write("%d:%d" % (seconds, int(time.time())))
log("Command '%s' exited with status %d, sleep %ds" % (command, status, seconds))
time.sleep(seconds)
exit(1)
else:
try:
os.remove(backoff_file)
except Exception:
pass I write a new script to implement this function then add
to |
+1 |
As explained Supervisor/supervisor#487 (comment), `Supervisord` > The mechanism that implements startretries does employ a backoff > strategy that increases the delay by 1 second with each attempt. > In other words, you can set startretries to a value that is large > enough to cover any "expected downtime" in your workflow without > causing a self-inflicted DoS scenario. I have tested by putting a `sleep 20` in the wait-for-postgres.sh wrapper on the backend, to delay its startup. This makes the client still available to finally register when the backend comes up. (default value for `startretries` is 3).
+1 |
1 similar comment
+1 |
Has this problem been solved? How? |
@zhaodanwjk current behaviour is the following: on first failure - restart after 1 second, on second failure - restart after 2 seconds, then 3 seconds, 4 seconds, etc. Until FATAL state. |
Thank you very much. We can only wait until the new function is developed or another method is adopted to solve it. |
@zhaodanwjk To solve what? What do you expect? Isn't |
@Chupaka Thank you very much, but I tried this method and could not solve my problem. I need to wait 100 seconds before restarting automatically when the service quits abnormally. |
@zhaodanwjk look at my previous comment #487 (comment) |
A better workaround is to use a script that restarts fatal processes by configuring an eventlistener that receives PROCESS_STATE_FATAL events. |
I will link the PR for |
Use case:
during a software update the server can't restart since some needed python modules are not available yet. Supervisor retries N times (N comes from config. AFAIK defaults to 3).
After N fails the state FATAL gets entered.
In my use case the programm entered state FATAL although just some seconds later the restart would have been successfull.
Can you understand this use case?
A possible solution would be to use a delay in unit "seconds": After N failed retries wait M seconds and then try again. In my case this should be done to infinity (endless loop).
The text was updated successfully, but these errors were encountered: