Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Onyx never recovers from a transactor going down #23

Open
greywolve opened this issue Nov 5, 2016 · 8 comments
Open

Onyx never recovers from a transactor going down #23

greywolve opened this issue Nov 5, 2016 · 8 comments

Comments

@greywolve
Copy link
Contributor

greywolve commented Nov 5, 2016

We often have our transactor failover in production, and whenever this happens, we end up having to restart Onyx, which gets stuck. I suspect this is because the current datomic connection is never given a chance to be renewed. It's better to always call (d/connect db-uri) to get a fresh connection each time. There is no performance penalty here, because datomic caches the current connection, but should the connection drop, this is the only way to re-establish a new one. I'm happy to give you a PR for this.

@MichaelDrogalis
Copy link
Contributor

We may be better of placing an example lifecycle exception handler that restarts the job on a failed Datomic write. While it's tempting to try and re-up the connection every time, I'm a little skittish about having Onyx re-established failed connections because things like backpressure are in play here. @lbradstreet Thoughts?

@greywolve
Copy link
Contributor Author

@MichaelDrogalis could you give me an example of restarting a job from a lifecycle function?

@lbradstreet
Copy link
Member

@greywolve @MichaelDrogalis I'm on the fence. What @greywolve is suggesting is probably reasonable seeing it caches the connection anyway.

That said, my best guess is that your problem is still due to "bad things happening" under certain circumstances, and the transact never being notified whether a write has failed or not, and thus never deref'ing since there is no timeout on the transact. Could we try adding a parameter for a write timeout, at which point it'll reboot the peer, and see if that fixes your problems in production?

@MichaelDrogalis
Copy link
Contributor

@lbradstreet @greywolve I don't disagree that re-upping the connection each time is a totally reasonable thing to do -- I just believe that it doesn't truly solve the problem.

@greywolve See this section, plus the cheat sheet link, and the example below: http://www.onyxplatform.org/docs/user-guide/0.9.x/#_handle_exception

Also see this test: https://github.com/onyx-platform/onyx/blob/0.9.x/test/onyx/peer/lifecycle_exception_test.clj

Let me know if it's still not clear after reading those, we can cook up a new example.

@lbradstreet
Copy link
Member

@MichaelDrogalis Yup, the main point I was trying to make is that not having a timeout on the future that the transact returns is probably the main issue.

@greywolve I had actually created this branch a while back to implement this solution #24. Could you try out this solution with your code? I have hard coded the write timeout to 15s, after which it will reboot the peer. If you are able to confirm that it fixes the problem I will pull the timeout into the task-map and document it and cut a release. You can try a snapshot with version 0.9.12.1-20161107.135121-2, or pull the branch and do it yourself.

@greywolve
Copy link
Contributor Author

@lbradstreet sorry I have been on leave. I actually tried to kill the transactor and bring it back up again, and Onyx carried on fine once I brought it back up (locally, with transactor dev storage). So it seems the default timeout is kicking in. The only Datomic API call which doesn't do this afaik is d/sync, which you don't need to use anywhere.

That being said, I will try out your code when I can get a staging env up that I can perform a transactor failover on, which is a bit different to just killing and upping the transactor. Once I can reproduce the bug like that, then we can try out this code. Thanks though, I'll hopefully get time to test it out soon. :) (You might still be right about the timeout, maybe it doesn't time out during the special case of a transactor failover)

@lbradstreet
Copy link
Member

@greywolve thanks for the continuing investigation. Hopefully we can get a reproducible test case, because it'll be hard to get a sure-fire fix without it. I wonder if we could memory starve the transactor a bit to try to trigger such cases.

@MichaelDrogalis
Copy link
Contributor

Thanks for the update @greywolve. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants