Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No such file / directory exceptions being thrown creating nodes when persisting XML #644

Closed
4 tasks done
fraz3alpha opened this issue Apr 30, 2018 · 6 comments
Closed
4 tasks done
Assignees

Comments

@fraz3alpha
Copy link
Contributor

fraz3alpha commented Apr 30, 2018

  • docker-plugin version you use - 1.1.4
  • jenkins version you use - 2.107.1
  • docker engine version you use - Version = swarm/1.2.8, API Version = 1.22, Docker 17.09.0-ce
  • stack trace / logs / any technical details that could help diagnose this issue

We have recently moved up to Jenkins version 2.107.1, and have also updated our plugin to the latest version - 1.1.4. We are being persistently hit by the feature that disables provisioning of a template for 5 minutes after an exception is thrown. We are regularly seeing the following exception which originates down in the code that persists the slave's configuration to an XML file on disk :

Apr 30, 2018 12:02:44 PM com.nirima.jenkins.plugins.docker.DockerCloud$1 run
SEVERE: Error in provisioning; template='DockerTemplate{configVersion=2, labelString='old-sl sl swarm_node sl_swarm_node sl_always_online sl_swarm_node_latest', connector=io.jenkins.docker.connector.DockerComputerSSHConnector@2432d04f, remoteFs='/home/jenkins', instanceCap=500, mode=EXCLUSIVE, retentionStrategy=com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy@5ecbf51, dockerTemplateBase=DockerTemplateBase{image='taas-docker-public.artifactory.swg-devops.com/swarm/dk-jenkins-slave:latest', pullCredentialsId='public.shared.artifactory.docker.registry.username.password', registry=null, dockerCommand='./setup_slave.sh', hostname='', dnsHosts=[9.9.9.9], network='', volumes=[], volumesFrom2=[], environment=[], bindPorts='', bindAllPorts=false, memoryLimit=null, memorySwap=null, cpuShares=null, privileged=true, tty=false, macAddress='null', extraHosts=[]}, removeVolumes=true, pullStrategy=PULL_NEVER, nodeProperties=[], disabled=BySystem,0 ms,4 min 59 sec,Template provisioning failed.}' for cloud='taas-internal-swarm'
java.nio.file.NoSuchFileException: /var/jenkins_home/nodes/docker-22f02f15d4988b/atomic3648298142656815004tmp
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
	at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:409)
	at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
	at java.nio.file.Files.move(Files.java:1395)
	at hudson.util.AtomicFileWriter.commit(AtomicFileWriter.java:206)
	at hudson.XmlFile.write(XmlFile.java:198)
	at jenkins.model.Nodes.save(Nodes.java:274)
	at hudson.util.PersistedList.onModified(PersistedList.java:173)
	at hudson.util.PersistedList.replaceBy(PersistedList.java:85)
	at hudson.model.Slave.setNodeProperties(Slave.java:299)
	at com.nirima.jenkins.plugins.docker.DockerTemplate.doProvisionNode(DockerTemplate.java:514)
	at com.nirima.jenkins.plugins.docker.DockerTemplate.provisionNode(DockerTemplate.java:468)
	at com.nirima.jenkins.plugins.docker.DockerCloud$1.run(DockerCloud.java:351)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
	Suppressed: java.nio.file.NoSuchFileException: /var/jenkins_home/nodes/docker-22f02f15d4988b/atomic3648298142656815004tmp -> /var/jenkins_home/nodes/docker-22f02f15d4988b/config.xml
		at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
		at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
		at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:396)
		at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
		at java.nio.file.Files.move(Files.java:1395)
		at hudson.util.AtomicFileWriter.commit(AtomicFileWriter.java:191)
		... 14 more

We followed the exception to the AtomicFileWriter code, and it is clear at that point that the temporary file that it should be trying to move doesn't exist.
We have also seen exceptions where the folder for the slave doesn't exist, e.g. :

Apr 30, 2018 12:08:06 PM com.nirima.jenkins.plugins.docker.DockerCloud$1 run
SEVERE: Error in provisioning; template='DockerTemplate{configVersion=2, labelString='old-sl sl swarm_node sl_swarm_node sl_always_online sl_swarm_node_latest', connector=io.jenkins.docker.connector.DockerComputerSSHConnector@2432d04f, remoteFs='/home/jenkins', instanceCap=500, mode=EXCLUSIVE, retentionStrategy=com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy@5ecbf51, dockerTemplateBase=DockerTemplateBase{image='taas-docker-public.artifactory.swg-devops.com/swarm/dk-jenkins-slave:latest', pullCredentialsId='public.shared.artifactory.docker.registry.username.password', registry=null, dockerCommand='./setup_slave.sh', hostname='', dnsHosts=[9.9.9.9], network='', volumes=[], volumesFrom2=[], environment=[], bindPorts='', bindAllPorts=false, memoryLimit=null, memorySwap=null, cpuShares=null, privileged=true, tty=false, macAddress='null', extraHosts=[]}, removeVolumes=true, pullStrategy=PULL_NEVER, nodeProperties=[], disabled=No}' for cloud='taas-internal-swarm'
java.io.IOException: Failed to create a temporary file in /var/jenkins_home/nodes/docker-22f07b1f653cad
	at hudson.util.AtomicFileWriter.<init>(AtomicFileWriter.java:144)
	at hudson.util.AtomicFileWriter.<init>(AtomicFileWriter.java:109)
	at hudson.util.AtomicFileWriter.<init>(AtomicFileWriter.java:84)
	at hudson.util.AtomicFileWriter.<init>(AtomicFileWriter.java:74)
	at hudson.XmlFile.write(XmlFile.java:187)
	at jenkins.model.Nodes.persistNode(Nodes.java:160)
	at jenkins.model.Nodes.addNode(Nodes.java:142)
	at jenkins.model.Jenkins.addNode(Jenkins.java:2054)
	at com.nirima.jenkins.plugins.docker.DockerCloud$1.run(DockerCloud.java:356)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: No such file or directory
	at java.io.UnixFileSystem.createFileExclusively(Native Method)
	at java.io.File.createTempFile(File.java:2024)
	at hudson.util.AtomicFileWriter.<init>(AtomicFileWriter.java:142)
	... 14 more

Both of these are pointing to the contents of the nodes folder not being as it expects - in the first case the source file has gone, and in the second case the folder it thought it had just created is not there.

I have been looking into this since Friday morning, and have considered a few possibilities, including:

  1. The Slave name is not unique, as it is created using final String uid = Long.toHexString(System.nanoTime());, link and nanoTime() is not guaranteed to increment every nanosecond - however we cannot find duplicates of the line LOGGER.info("Trying to run container for node {} from image: {}", uid, getImage());, indicating that all names are unique in the window we are looking at - so it is probably not to do with that

  2. The nodes/ folder is being cleaned/reset. and Jenkins doesn't know it should keep the folder for the Slave that is in the process of being created.
    I think this is likely to be the root of the problem, but I haven't found what is responsible yet. For example, there is a timing window between saving the configuration for a node via setNodeProperties (which is listed in the stack trace as the call that ultimately results in the exceptions we are seeing)- and adding it to the list of nodes that Jenkins knows about with addNode()

If, at any point, save() is called on the Nodes class, it will remove any node folders not added to the server, so there is a definite timing window here

Curiously, all our node folders have the same timestamp, here is an excerpt:

drwxr-xr-x 2 jenkins jenkins 4096 Apr 30 12:22 docker-22f13e1e5c6047
drwxr-xr-x 2 jenkins jenkins 4096 Apr 30 12:22 docker-22f13e1e5e5d7d
drwxr-xr-x 2 jenkins jenkins 4096 Apr 30 12:22 docker-22f13e1fa6ba28
drwxr-xr-x 2 jenkins jenkins 4096 Apr 30 12:22 docker-22f13e20d054ac
drwxr-xr-x 2 jenkins jenkins 4096 Apr 30 12:22 docker-22f13e20d1dfca
drwxr-xr-x 2 jenkins jenkins 4096 Apr 30 12:22 taas-runbook-prod
drwxr-xr-x 2 jenkins jenkins 4096 Apr 30 12:22 taas-runbook-prod-littleton
...

The docker- slaves are dynamic, but those two labelled taas-runbook are not, so they shouldn't be being modified.

We periodically see the timestamp on all files and folders being updated, e.g. 11:00, 11:05, 11:06 - not every minute, but occasionally.

I feel like there must be a timing window somewhere where the folder is being pruned of things that should not exist, and this happens between crucial parts of the configuration persistence which means that the temporarily file, or node folder, doesn't exist.

Alas there does not appear to be any logger we can add to Nodes.save() to see when this gets called.

@fraz3alpha fraz3alpha changed the title No such file / directory exceptions being thrown creating nodes when saving config No such file / directory exceptions being thrown creating nodes when persisting XML Apr 30, 2018
@fraz3alpha
Copy link
Contributor Author

fraz3alpha commented Apr 30, 2018

Just to clarify my point above, I think the problem, at least in part, occurs because the setNodeProperties() method writes data to disk, when it might not be expected to. i.e. in the following block of code it appears that we are just setting object variables, ready to return a node if all goes well, but setNodeProperties() does more than that:

            final DockerTransientNode node = new DockerTransientNode(uid, containerId, remoteFs, launcher);
            node.setNodeDescription("Docker Agent [" + getImage() + " on "+ api.getDockerHost().getUri() + " ID " + containerId + "]");
            node.setMode(mode);
            node.setLabelString(labelString);
            node.setRetentionStrategy(retentionStrategy);
            node.setNodeProperties(nodeProperties);
            node.setRemoveVolumes(removeVolumes);
            node.setDockerAPI(api);
            finallyRemoveTheContainer = false;
            return node;

As you can see, Slave.setNodeProperties() actually calls the replaceBy() method on the DescribableList<NodeProperty<?>,NodePropertyDescriptor> nodeProperties object:

    public void setNodeProperties(List<? extends NodeProperty<?>> properties) throws IOException {
        nodeProperties.replaceBy(properties);
    }

DescribableList extends PersistedList, and that is what causes the contents of the properties to be persisted to an XML file - likely in a place we didn't expect anything like this to happen.

Meanwhile, as this node isn't in the list of nodes that Jenkins knows about, the node folder and the contents within are liable to be cleaned up for being non-current nodes.

I hope this helps debug the problem. I've been looking at this all day, and have gone cross-eyed looking at all the classes and exceptions, so I hope I've done a reasonable job of explaining what we've found so far - I'll update if we come up with anything else.

Update:
Given that we actually don't care if the data is persisted to disk at this point, we could swallow this error, and allow the addNode() call to persist it to disk at the point it is added to the server nodes model.
PersistedList.replaceBy() calls PersistedList.onModified(), which could throw an IOException, but only after it has stored the data in the object - so catching the IOException would not hurt the normal operation of the docker-plugin code

@fraz3alpha
Copy link
Contributor Author

Since deploying the fix by @PeppyHare we have seen a vast improvement in our Jenkins instance's ability to get through work. Before the fix provisioning of new template would have a brief spurt every 5 minutes, but be blocked then until the timer ran out. See the graphs below to see the queue backing up, and the throughput being very low until we restarted with the new plugin version (the yellow and green vertical bars)

screen shot 2018-05-01 at 10 09 43
screen shot 2018-05-01 at 10 09 52
screen shot 2018-05-01 at 10 10 03

Curiously, it doesn't seem like this has fixed the problem completely - as you can see from this Kibana graph of our logs, we are catching a lot of IOExceptions with the new code, but there are still a lot of NoSuchFileException errors getting through:

screen shot 2018-05-01 at 10 23 12

Looking into these, it seems we are now uncovering exceptions coming out of the addNode() code in Jenkins:

May 01, 2018 5:24:12 AM hudson.util.AtomicFileWriter commit
WARNING: Unable to move atomically, falling back to non-atomic move.
java.nio.file.NoSuchFileException: /var/jenkins_home/nodes/docker-2329036a4dfbe2/atomic4822874893213685997tmp -> /var/jenkins_home/nodes/docker-2329036a4dfbe2/config.xml
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
	at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:396)
	at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
	at java.nio.file.Files.move(Files.java:1395)
	at hudson.util.AtomicFileWriter.commit(AtomicFileWriter.java:191)
	at hudson.XmlFile.write(XmlFile.java:198)
	at jenkins.model.Nodes.persistNode(Nodes.java:160)
	at jenkins.model.Nodes.addNode(Nodes.java:142)
	at jenkins.model.Jenkins.addNode(Jenkins.java:2054)
	at com.nirima.jenkins.plugins.docker.DockerCloud$1.run(DockerCloud.java:356)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

These are then also bubbling up and being caught by the node provisioning code:

May 01, 2018 5:24:12 AM com.nirima.jenkins.plugins.docker.DockerCloud$1 run
SEVERE: Error in provisioning; template='DockerTemplate{configVersion=2, labelString='old-sl sl swarm_node sl_swarm_node sl_always_online sl_swarm_node_latest', connector=io.jenkins.docker.connector.DockerComputerSSHConnector@5a7cb0f3, remoteFs='/home/jenkins', instanceCap=500, mode=EXCLUSIVE, retentionStrategy=com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy@6a97e5a5, dockerTemplateBase=DockerTemplateBase{image='taas-docker-public.artifactory.swg-devops.com/swarm/dk-jenkins-slave:latest', pullCredentialsId='public.shared.artifactory.docker.registry.username.password', registry=null, dockerCommand='./setup_slave.sh', hostname='', dnsHosts=[9.9.9.9], network='', volumes=[], volumesFrom2=[], environment=[], bindPorts='', bindAllPorts=false, memoryLimit=null, memorySwap=null, cpuShares=null, privileged=true, tty=false, macAddress='null', extraHosts=[]}, removeVolumes=true, pullStrategy=PULL_NEVER, nodeProperties=[], disabled=No}' for cloud='taas-internal-swarm'
java.nio.file.NoSuchFileException: /var/jenkins_home/nodes/docker-2329036a4dfbe2/atomic4822874893213685997tmp
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
	at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:409)
	at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
	at java.nio.file.Files.move(Files.java:1395)
	at hudson.util.AtomicFileWriter.commit(AtomicFileWriter.java:206)
	at hudson.XmlFile.write(XmlFile.java:198)
	at jenkins.model.Nodes.persistNode(Nodes.java:160)
	at jenkins.model.Nodes.addNode(Nodes.java:142)
	at jenkins.model.Jenkins.addNode(Jenkins.java:2054)
	at com.nirima.jenkins.plugins.docker.DockerCloud$1.run(DockerCloud.java:356)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
	Suppressed: java.nio.file.NoSuchFileException: /var/jenkins_home/nodes/docker-2329036a4dfbe2/atomic4822874893213685997tmp -> /var/jenkins_home/nodes/docker-2329036a4dfbe2/config.xml
		at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
		at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
		at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:396)
		at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
		at java.nio.file.Files.move(Files.java:1395)
		at hudson.util.AtomicFileWriter.commit(AtomicFileWriter.java:191)
		... 11 more

I'll continue investigating

@fraz3alpha
Copy link
Contributor Author

I have just tidied up the Kibana graph a bit and it seems that we have successfully eliminated the issues from the setNodeProperties() IOException, but we still have a set of errors from from the addNode() method, which have remained somewhat constant throughout:

screen shot 2018-05-01 at 10 59 22

I was not aware of the exceptions coming from addNode() when I originally raised this issue, as they were swamped by the exceptions from setNodeProperties() which were the main target of the investigation, ultimately resulting in the patch by @PeppyHare .

Given that we didn't see any of this before moving up to the latest Jenkins, I'm suspicious of something there now that we have worked around the one problem this plugin could have caused - at the moment there doesn't seem to be any way that I can see that calling addNode() could result in an error from the actions of the docker plugin

@pjdarton
Copy link
Member

pjdarton commented May 8, 2018

Folks,
I was on vacation when this was raised, hence wasn't in a position to investigate/comment, but I'm back at work now and wading through the backlog of emails etc.

Just to clarify a couple of points:

  • Firstly, this exception is not being caused by recent changes made in the docker plugin. The change here is that now it's being seen because now, when the plugin encounters an exception when creating a new container from a particular template, it logs the problem against that template and disables the template for 5 minutes. It's that disabling action that's making this issue more visible.
  • It's possible that this exception is caused by changes in the latest version of Jenkins. It's equally possible that this exception has always been the case.
  • I've not noticed this issue myself, but then my primary Jenkins server runs on Windows and it's possible that this mostly affects unix-based Jenkins services.

As for solutions...

  • Adding retries around unreliable methods is one option. e.g. .setProperties().
  • Ensuring that non-thread-safe methods aren't called at unsafe times.
  • Raising JIRA issues against the Jenkins core code to get any not-thread-safe methods rendered thread-safe.

@pjdarton
Copy link
Member

pjdarton commented May 8, 2018

OK, I've been looking into this.
setNodeProperties : I think this is a bug in the Jenkins core. I've raised JENKINS-51203 for this but the long and short of it is that the "nodeProperties" list thinks that its "owner" (i.e. what needs to be re-persisted to disk every time it changes) is the Jenkins core's master list of all Nodes, whereas I think that it should just be "the slave that the nodeProperties belong to".
addNode(): I don't think we should be calling this at all. The core code's provisioning logic will automatically add any node that we return to it (once the PlannedNode future completes). I think that this call was a copy/paste from other plugins that had it as a (poor) workaround to the (more general) over-provisioning bug whereby the Jenkins core keeps on asking for more slaves even though enough are in the process of being created.

I've talked to @fraz3alpha and we're going to test removing the call to addNode() and see if that's workable. If that works then I'll change the plugin code to remove that permanently. Then we can work on putting some robustness retries around setNodeProperties. I guess we can also put in an opportunistic optimization so that setNodeProperties doesn't get called at all unless the template provides some properties to be set.

@pjdarton
Copy link
Member

Code changes have been merged. Will be fixed in the next release (which will probably be 1.1.5).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants