-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No such file / directory exceptions being thrown creating nodes when persisting XML #644
Comments
Just to clarify my point above, I think the problem, at least in part, occurs because the
As you can see, Slave.setNodeProperties() actually calls the
Meanwhile, as this node isn't in the list of nodes that Jenkins knows about, the node folder and the contents within are liable to be cleaned up for being non-current nodes. I hope this helps debug the problem. I've been looking at this all day, and have gone cross-eyed looking at all the classes and exceptions, so I hope I've done a reasonable job of explaining what we've found so far - I'll update if we come up with anything else. Update: |
Since deploying the fix by @PeppyHare we have seen a vast improvement in our Jenkins instance's ability to get through work. Before the fix provisioning of new template would have a brief spurt every 5 minutes, but be blocked then until the timer ran out. See the graphs below to see the queue backing up, and the throughput being very low until we restarted with the new plugin version (the yellow and green vertical bars) Curiously, it doesn't seem like this has fixed the problem completely - as you can see from this Kibana graph of our logs, we are catching a lot of IOExceptions with the new code, but there are still a lot of NoSuchFileException errors getting through: Looking into these, it seems we are now uncovering exceptions coming out of the addNode() code in Jenkins:
These are then also bubbling up and being caught by the node provisioning code:
I'll continue investigating |
I have just tidied up the Kibana graph a bit and it seems that we have successfully eliminated the issues from the I was not aware of the exceptions coming from Given that we didn't see any of this before moving up to the latest Jenkins, I'm suspicious of something there now that we have worked around the one problem this plugin could have caused - at the moment there doesn't seem to be any way that I can see that calling |
Folks, Just to clarify a couple of points:
As for solutions...
|
OK, I've been looking into this. I've talked to @fraz3alpha and we're going to test removing the call to |
Code changes have been merged. Will be fixed in the next release (which will probably be 1.1.5). |
1.1.4
2.107.1
Version = swarm/1.2.8, API Version = 1.22, Docker 17.09.0-ce
We have recently moved up to Jenkins version
2.107.1
, and have also updated our plugin to the latest version -1.1.4
. We are being persistently hit by the feature that disables provisioning of a template for 5 minutes after an exception is thrown. We are regularly seeing the following exception which originates down in the code that persists the slave's configuration to an XML file on disk :We followed the exception to the AtomicFileWriter code, and it is clear at that point that the temporary file that it should be trying to move doesn't exist.
We have also seen exceptions where the folder for the slave doesn't exist, e.g. :
Both of these are pointing to the contents of the nodes folder not being as it expects - in the first case the source file has gone, and in the second case the folder it thought it had just created is not there.
I have been looking into this since Friday morning, and have considered a few possibilities, including:
The Slave name is not unique, as it is created using
final String uid = Long.toHexString(System.nanoTime());
, link and nanoTime() is not guaranteed to increment every nanosecond - however we cannot find duplicates of the lineLOGGER.info("Trying to run container for node {} from image: {}", uid, getImage());
, indicating that all names are unique in the window we are looking at - so it is probably not to do with thatThe
nodes/
folder is being cleaned/reset. and Jenkins doesn't know it should keep the folder for the Slave that is in the process of being created.I think this is likely to be the root of the problem, but I haven't found what is responsible yet. For example, there is a timing window between saving the configuration for a node via setNodeProperties (which is listed in the stack trace as the call that ultimately results in the exceptions we are seeing)- and adding it to the list of nodes that Jenkins knows about with addNode()
If, at any point, save() is called on the Nodes class, it will remove any node folders not added to the server, so there is a definite timing window here
Curiously, all our node folders have the same timestamp, here is an excerpt:
The
docker-
slaves are dynamic, but those two labelledtaas-runbook
are not, so they shouldn't be being modified.We periodically see the timestamp on all files and folders being updated, e.g.
11:00
,11:05
,11:06
- not every minute, but occasionally.I feel like there must be a timing window somewhere where the folder is being pruned of things that should not exist, and this happens between crucial parts of the configuration persistence which means that the temporarily file, or node folder, doesn't exist.
Alas there does not appear to be any logger we can add to Nodes.save() to see when this gets called.
The text was updated successfully, but these errors were encountered: