-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add better handling of exec operations during poweroff #6787
Conversation
8ab0bb5
to
a85da9a
Compare
d50d55b
to
f07566a
Compare
3158431
to
b7e0683
Compare
err = c.containerProxy.CommitContainerHandle(handleprime, id, 0) | ||
if err != nil { | ||
op.Errorf("Failed to commit exec handle for container(%s) due to error(%s)", id, err) | ||
// FIXME: We cannot necessarily check for IsConflictError here yet since the state check also returns conflict error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may no longer be true. I think I fixed this, I will be checking whenever time permits from the current issues I am working on.
lib/portlayer/task/common.go
Outdated
task := taskS | ||
if handle.Runtime != nil && handle.Runtime.PowerState != types.VirtualMachinePowerStatePoweredOff { | ||
op.Debugf("Task bind configuration applies to ephemeral set") | ||
if okE { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean exec task gets precedence over Session task?
Why would not you then construct the logic like this:
var task *executor.SessionConfig
if taskE, okE := handle.ExecConfig.Execs[id]; okE {
task = taskE
} else if taskS, okS := handle.ExecConfig.Sessions[id]; okS
task = taskS
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually have been fairly unsure if crowding a branching statement with the var assignments is good practice with golang. I have made the change and the code definitely looks more concise, I avoided this orginally since I was worried about readability.
lib/portlayer/task/common.go
Outdated
} | ||
|
||
// if no task has been joined that can be manipulated in the container's current state | ||
if task == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this code makes no sense anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your absolutely right. I was focused on the above case and forgot to remove this one. There still reamains a questing there for @hickeng as to whether or not we keep the old error message or we take the new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catche on the empty if statement, I am surprised that govet
did not complain at all
bf97b8f
to
bb2c10c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a real concern about how ConflictError is currently being used as a mistaken or mismatched eid and cid could result in a loop that's always returning NotFound which gets translated into ConflictError which gets translated into retry.
${rc} ${output}= Run And Return Rc And Output docker %{VCH-PARAMS} pull ${busybox} | ||
Should Be Equal As Integers ${rc} 0 | ||
Should Not Contain ${output} Error | ||
${suffix}= Evaluate '%{DRONE_BUILD_NUMBER}-' + str(random.randint(1000,9999)) modules=random |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have a newline above this - they're cheap and the add readability between steps.
Should Be Equal As Integers ${rc} 0 | ||
Should Not Contain ${output} Error | ||
${suffix}= Evaluate '%{DRONE_BUILD_NUMBER}-' + str(random.randint(1000,9999)) modules=random | ||
Set Test Variable ${ExecPowerOffContainer} I-Have-Two-Anonymous-Volumes-${suffix} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I-Have-Two-Anonymous-Volumes
is confusing:
a. I don't think it does
b. If using busybox
or any other image specifically because of volume config, then tag it with a name. That way if/when the image changes on us it's easy to pick a new one that's suitable and there's no confusion as to why
c. The volume facet seems to be completely irrelevant to this test.
If you're going to provide any sort of structured naming, then consider using Exec-During-PowerOff-${suffix}
so it's immediately obvious which test it belongs to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does not, this was an accidental naming collision when I was writing some code for the volume deletion issue I handled during vic-week sorry about that.
#\ ${combinedOutput}= Catenate ${combinedOutput} ${result.stderr}${\n} | ||
# | ||
#Should Contain ${combinedOutput} Cannot complete the operation, container ${id} has been powered off during execution | ||
${rc} ${output}= Run And Return Rc And Output docker %{VCH-PARAMS} pull ${busybox} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test should have an entry in the .md
file describing what you are attempting to test and why, then the steps used to accomplish it.
This maps well to the hypothesis
, test approach
, test details
stages I described at the engineering meeting.
Should Be Equal As Integers ${rc} 0 | ||
|
||
:FOR ${idx} IN RANGE 1 20 | ||
\ Start Process docker %{VCH-PARAMS} exec ${id} /bin/top alias=exec-%{VCH-NAME}-${idx} shell=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may actually get higher likelihood of failure if this isn't a long running process as we'll be attempting to handle exit of exec sessions while the container is shutting down as well as start of new ones.
It's also possible that the short running processes should be a separate step or test as it may elicit distinctly different errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make a separate test for this. Then I can confirm how it performs.
lib/portlayer/task/inspect.go
Outdated
} | ||
return nil, powerStateError | ||
} | ||
// print all of them, otherwise we will have to assemble the id list regardless of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment makes no sense to me. It implies that you would do something different if you could check log level - what would that be and why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this more, I think I was trying to avoid looking through the lists. But the lists come from the handle so they will all be associated with the correct call(or should be barring some other failure). I am going to remove this comment entirely.
go func() { | ||
defer trace.End(trace.Begin(eid)) | ||
|
||
// FIXME: NEEDS CONTAINER PROXY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this now has a container proxy entry - should be updated to use it.
@@ -481,8 +406,11 @@ func (c *Container) ContainerExecStart(ctx context.Context, eid string, stdin io | |||
defer cancel() | |||
|
|||
// we do not return an error here if this fails. TODO: Investigate what exactly happens on error here... | |||
// FIXME: This being in a retry could lead to multiple writes to stdout(?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be writing an error in the goroutine - this needs properly integrating with the retry logic and context cancellation.
I'm unsure why this is in a goroutine actually. We should check to see if an exec_start
event is generated when trying to exec an non-existent binary in regular docker - if it reflects the container behaviour it will not generate that event on failure.
Please figure out if this needs to be in a goroutine and comment why in the code or adjust it appropriately.
if err != nil { | ||
op.Errorf("Failed to create exec task for container(%s) due to error(%s)", id, err) | ||
return "", InternalServerError(err.Error()) | ||
// NOTE: we should investigate what we can add or manipulate in the handle in the portlayer to make this check even more granular. Maybe Add the actually State into the handle config or part of the container info could be "snapshotted" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment about GetStateFromHandle
returning state much the same as GetContainerInfo
or the truncated form from GetContainerList
- this could then passed through the convert.State()
function to get the granular state I think you're talking about here.
// NOTE: we should investigate what we can add or manipulate in the handle in the portlayer to make this check even more granular. Maybe Add the actually State into the handle config or part of the container info could be "snapshotted" | ||
// This check is now done off of the handle. | ||
if state != "RUNNING" { | ||
return ConflictError(fmt.Sprintf("Container (%s) is not powered on, please start the container before attempting to an exec an operation", name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error is actually too coarse - the container could still be Starting
or Stopping
or a bunch of transient states. The transient states would ideally result in a retry given they're inherently unstable. Starting is the state where this is most likely to occur for us as it's the state that doesn't really exist in regular docker where a process is either running or it's not.
Same comment applies about verbosity - and we already have this message in string form elsewhere suggesting a common state check to error function mapping could be useful: container is not running, unable to perform an exec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the terse error message.
return ConflictError(fmt.Sprintf("Container (%s) is not powered on, please start the container before attempting to an exec an operation", name)) | ||
} | ||
|
||
op.Debugf("State checks succeeded for exec operation on container(%s)", id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're going to have a log message from the CreateExecTask entry so this debug message adds no value.
@@ -23,5 +23,41 @@ This test requires that a vSphere server is running and available | |||
* Step 2-6 should echo the ID given | |||
* Step 7 should return an error | |||
|
|||
## Exec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a stray heading.
# Possible Problems: | ||
None | ||
|
||
# Exec Power Off test for long running Process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a fan of this test doc approach as it makes it easier to follow along and understand what the test is accomplishing, as opposed to simply listing commands in the markdown file. cc @mhagen-vmware
Overhauling all our test docs would be way too tedious at this point, I propose we adopt this approach for all new tests and fix up any old ones we can along the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to this @mhagen-vmware thoughts?
return "", InternalServerError(err.Error()) | ||
// NOTE: we should investigate what we can add or manipulate in the handle in the portlayer to make this check even more granular. Maybe Add the actually State into the handle config or part of the container info could be "snapshotted" | ||
// This check is now done off of the handle. | ||
if state != "RUNNING" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use a constant here? I see state.Running
was used previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So state.Running
was a bool from the docker ContainerState
struct previously. these new returns are really string representations of the internal state kept in the portlayer. So the portlayer is relaying the snap shotted state. I will make some constants for them, but I will not be planning to move to using the ContainerState
struct again as it does not easily map for us. We do have a covert.state
function. but it actually takes the VicContainer
struct from the persona cache as the direct input. We are using a handle in order to check the state in the snapshot of time where we began manipulating aspects of the container.
678dde7
to
532ff7b
Compare
532ff7b
to
d3d9f03
Compare
return InternalServerError(fmt.Sprintf("Container (%s) is not running", name)) | ||
case "STARTING": | ||
// This is a transient state, returning conflict error to trigger a retry in the operation. | ||
return ConflictError(fmt.Sprintf("container (%s) is still starting", id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Now this is how it's supposed to be used!
@@ -289,7 +298,7 @@ func (c *Container) ContainerExecInspect(eid string) (*backend.ExecInspect, erro | |||
// Look up the container name in the metadata cache to get long ID | |||
vc := cache.ContainerCache().GetContainerFromExec(eid) | |||
if vc == nil { | |||
return nil, NotFoundError(eid) | |||
return nil, InternalServerError(fmt.Sprintf("No container was found with exec id: %s", eid)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the NotFoundError
? This does seem like a 404 - presumably an incorrect eid
was supplied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we look into the NotFoundError
function it is actually prefacing it with "Container Not Found". What I will do is create my own custom TaskNotFoundError
function instead. and place it in the same place as the NotFoundError
function.
@@ -155,13 +155,13 @@ Exec During Poweroff Of A Container Performing A Short Running Task | |||
|
|||
${suffix}= Evaluate '%{DRONE_BUILD_NUMBER}-' + str(random.randint(1000,9999)) modules=random | |||
Set Test Variable ${ExecPoweroffContainerShort} Exec-Poweroff-${suffix} | |||
${rc} ${id}= Run And Return Rc And Output docker %{VCH-PARAMS} run -itd --name ${ExecPoweroffContainerShort} ${busybox} sleep 5 | |||
${rc} ${id}= Run And Return Rc And Output docker %{VCH-PARAMS} run -itd --name ${ExecPoweroffContainerShort} ${busybox} /bin/ls /; sleep 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The chained commands here make me uncomfortable as this would not work as expected when run in a shell.
Does it still work if you use "/bin/ls; sleep 20"
?
Also, why are we making this sleep longer? If exec is running fast then we could easily have completed all of the execs before the sleep completes. I thought the essence of this test was to issue and exec as the container is stopping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exec is running faster in some cases. But I am unsure if the execs will all work here. we can change it back to 5. I changed it 20 since I saw some exec's succeeding sometimes. I will try your additional suggestion above and lower it to 5 seconds. then run it and see if any of the execs worked. they were not before which was odd.
@@ -232,16 +241,16 @@ func (c *Container) ContainerExecCreate(name string, config *types.ExecConfig) ( | |||
} | |||
|
|||
switch state { | |||
case "STOPPED": | |||
case StoppedState: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use a comma separated list in a case statement:
switch state {
case CreatedState, StoppedState, SuspendedState:
return InternalServerError(fmt.Sprintf("Container (%s) is not running", name))
case StartingState:
return ConflictError(fmt.Sprintf("container (%s) is still starting", id))
case RunningState:
// noop
default:
return InternalServerError(fmt.Sprintf("Container (%s) is in an unknown state: %s", id, state))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will fix this. I knew I was doing something silly there when I had so many wasted lines.
Should Contain ${combinedErr} Container (${id}) is not running | ||
|
||
# We should get atleast one successful exec... | ||
Should Contain ${combinedOut} bin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pull this out into a separate keyword - it's significant duplication in just these two tests.
I think our logic here is as follows and we can be explicit about it:
- should contain all directory entries, or
- should contain expected error messages:
a. container is powered off
b. container shut down during exec
c. ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this I would expect at least one success. So the keyword itself should probably check for both 1 and 2. in combination. Simple Exec
is not meant to fail. If we abstract it out to a keyword then I can add the OR logic there in an effort to reduce the success checking in general. but having an OR on success and failure may mean that test never fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed on the keyword breakout. agreed on the OR. by my reading of this test it is completely valid to get returned back 10 valid errors and never see the directory output - especially if the first exec takes more than 10s to happen..
|
||
${suffix}= Evaluate '%{DRONE_BUILD_NUMBER}-' + str(random.randint(1000,9999)) modules=random | ||
Set Test Variable ${ExecSimpleContainer} Exec-simple-${suffix} | ||
${rc} ${id}= Run And Return Rc And Output docker %{VCH-PARAMS} run -itd --name ${ExecSimpleContainer} ${busybox} sleep 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you meant this to be indefinite, not sleep 30
. It looks as though this is testing concurrent exec, without the risk of container stopping underneath us.
If it is supposed to be container stopping underneath the exec's then I don't know how it differs from the
Exec During Poweroff Of A Container Performing A Long Running Task
test.
In fact it's not clear how that test is intended to differ from the Exec During Poweroff Of A Container Performing A Short Running Task
test.
Perhaps one of them was intended to have long running exec tasks so we can see what happens when the container dies underneath them rather than dies while trying to issue new ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One was expecting full success and the other not. Once i had written them and reworked some of them they ended up being similar. I believe the short running task should be shifted to having a long running exec I think that would be a valuable use of that test. The simple case is to say we have a container and we expect all execs to succeed. So I will change that to /bin/top
as it should definitely be indefinite(pun not intended).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just one minor issue
1. Pull an image that contains `sleep`. Busybox suffices here. | ||
2. Create a container running `sleep` to simulate a short running process. | ||
3. Run the container and detach from it. | ||
4. Start 20 simple exec operations in parallel against the detached container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not the steps you have implemented below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yeah I need to clean that up since I changed the test. :( fixing that now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, done from earliest commit to latest so some may be out of date.
None are blockers, so approving. Please ensure there's an issue open to follow up on the large number of current exec's behaviour
3. Run the container and detach from it. | ||
4. Start 20 simple exec operations in parallel against the detached container. | ||
5. Stop the container while the execs are still running in order to trigger the exec power off errors. | ||
5. Explicitly stop the container while the execs are still running in order to trigger the exec power off errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a conceptual race between (4) and (5). You are attempting to ensure that some of the execs have been dispatched before stop takes effect, and that some are issued during the stop process.
It's ok to note that this test is stochastic and not expected to behave in the same manner every time.
@@ -267,7 +267,12 @@ func (c *Container) ContainerExecCreate(name string, config *types.ExecConfig) ( | |||
return nil | |||
} | |||
|
|||
if err := retry.Do(operation, IsConflictError); err != nil { | |||
// configure custom exec back off configure | |||
backoffConf := retry.NewBackoffConfig() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have the same config duplicated twice - consider consolidating if they are supposed to be the same.
Also note why those times have been chosen; if lacking concrete numbers it's okay to express what you're attempting to accomplish with these timings. At the moment it's left to the imagination of the reader.
@@ -171,19 +171,41 @@ Exec During Poweroff Of A Container Performing A Long Running Task | |||
${rc} ${id}= Run And Return Rc And Output docker %{VCH-PARAMS} run -itd --name ${ExecPoweroffContainerLong} ${busybox} /bin/top |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have sleep
documented in the test markdown
lib/portlayer/task/wait.go
Outdated
@@ -33,7 +33,7 @@ func Wait(op *trace.Operation, h interface{}, id string) error { | |||
} | |||
|
|||
if handle.Runtime != nil && handle.Runtime.PowerState != types.VirtualMachinePowerStatePoweredOn { | |||
err := fmt.Errorf("Unable to wait for task when container %s is not running", id) | |||
err := fmt.Errorf("Unable to wait for task when container %s is not running", handle.ExecConfig.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incorrect capitalization of error message
Verify Poweroff During Exec Error Message | ||
[Arguments] ${error} ${containerID} ${containerName} | ||
Set Test Variable ${msg1} Container (${containerName}) is not running | ||
Set Test Variable ${msg2} container (${containerID}) has been poweredoff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
poweredoff
-> powered off
, or even stopped
a412fef
to
2cd8d65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick read-through - personality changes LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personality changes: LGTM 👍
efbf8b7
to
96aa3de
Compare
* fix wording in test file * fix wording in exec create persona path * fix wording in the task inspect portlayer path
Also removes incorrect and unneeded comments
We were not reconfiguring sessions for Runblock if they were not configured that way initially, yet were reconfigured before activation.
Also augments the markdown file with more information under the test purpose section
Also force a build
fe4495e
to
3d6ec6e
Compare
Codecov Report
@@ Coverage Diff @@
## master #6787 +/- ##
==========================================
- Coverage 31.53% 31.52% -0.02%
==========================================
Files 278 278
Lines 41894 42004 +110
==========================================
+ Hits 13213 13243 +30
- Misses 27518 27601 +83
+ Partials 1163 1160 -3
Continue to review full report at Codecov.
|
[full ci]
This PR has a number of changes to the exec path. Mostly adding more structured handling of scenarios. list of changes:
ExecStart
to fetch new handle and attempt the configuration from the beginning.task.Inspect
,task.Bind
, andtask.Unbind
InspectTask
andBindTask
. Looks likeUnbindTask
is not used anywhere until ref counting on the streams is implemented.Fixes #6744
Fixes #6370
Also will affect tickets like #5819 as they encountered stream issues during an exec. This should be significantly reduced on one off execs. Though is still easily forcible with enough concurrent execs against the same target. Can also be forced by catching the container as it is shutting down at just the right time.