Concurrent Exec tests intermittently fail with various errors #7410

matthewavery · 2018-02-27T20:55:31Z

After PR #6787 exec has become more stable in the concurrent and container shutdown paths. However, it still suffers from some race condition style failures. This has mainly been seen in the CI environment and has passed locally(consecutively) against a high resource complex VSAN deployment. below is a short catalog of the failures that have been seen that will cause intermittent failures:

Simple Concurrent Exec:

Failed with a Feb 26 2018 16:27:06.142Z ERROR op=301.310: CommitHandler error on handle(c13c27abff7908e6146498325fa944c3) for 46bec94ca67b332d0f924df4f4d7c84168693b876a8aa15658050b2e3b2bf46e: The operation is not allowed in the current state.

Exec During Poweroff Of A Container Performing A Long Running Task:

Failed because all execs actually succeeded(might need to improve test expectations)
Container reported being in an invalid state.

Exec During Poweroff Of A Container Performing A Short Running Task:

Long running exec returned an rc of 1 even though the output looked present.(looks successful in the portlayer of Exec-Failure-1)

REFERENCE LOGS:

Exec-CI-Failure-1-16518.zip
Exec-CI-Failure-2-16515.zip
Exec-Failure-3-FROM-FULL-CI.html.zip
Exec-Failure-3-FROM-FULL-CI-16510.zip

Currently using the following for basic concurrency testing:

c=1;date;id=`docker ps -q | awk '{print$1}'`; for i in `seq 1 $c`; do docker ^Cec $id /bin/echo /tmp/$i &done;for i in `seq 1 $c`;do wait %$i 2>/dev/null;done;date

TODO:

investigate why vCenter reconfigure slows down over time with number of execs. Could be number of extraconfig keys in the VM config (we should not be sending the entire set to VC, but VC may be relaying the entire set to ESX - compare/contrast with ESX directy). Could also be that reconfigures result in accrual of state, irrespective of the keys.

The text was updated successfully, but these errors were encountered:

mdubya66 · 2018-03-12T20:25:10Z

Making p2 for now. If it appears more frequently we can raise it again

mhagen-vmware · 2018-03-12T20:29:38Z

We will not see this fail because the tests are currently disabled:
https://github.com/vmware/vic/blob/master/tests/test-cases/Group1-Docker-Commands/1-38-Docker-Exec.robot

hickeng · 2018-04-26T09:29:51Z

Now blocking a product go-live.

My suggestion for addressing the collisions is to add operation batching. The first vsphere operation will be dispatched, and if slow will cause the others to batch up behind it. The next operation issued will be the composite of the blocked operations.

hickeng · 2018-05-23T14:00:23Z

Batching mechanism is written, and now needs extracting from in-place into a distinct package as this also applies to volume create, disk pull, and host affinity.
The outstanding piece for functional demonstration is the merging of the specs. I tried a general approach to begin with but it got too large for this specific issue. Now need to change it to be exec specific.

hickeng · 2018-05-24T04:38:26Z

I have batching code that is somewhat effective. We still legitimately see concurrent modifications at the port layer level, but we should not be seeing them propagate all the way back to the user in comparatively light loads, yet we are.

The following is needed to determine what's occurring:

add logging of the handle to the port layer rest handlers so the full operation can be tracked (or provide cross component opid which would be better)
determine whether the operations are taking so long that this is a legitimate issue, or
figure out why a random backoff with increasing window is resulting in such frequent collisions - possible error in window implementation?

Also noted from inspecting the logs - there's way too many task.Inspect operations happening. It may be that @matthewavery work on #7969 will address this, but if not then I should track it down.

Also need to add the group serialization logic, or we will rarely actually hit the batching path. Currently exercising it via a hardcoded delay but that's not viable for anything except testing.

hickeng · 2018-05-25T17:44:30Z

I've added #7370 as a dependency because there's something in that space needed to correctly render the opid parent/child relationship in addition to the parentID being passed via swagger.

These two issues are added as dependencies as untangling the cross component concurrent operations is too unwieldy for effective progress in it's current form.

I am going to:

add a palliative semaphore to the exec logic in the personality as a temporary measure, but that's not a fix for the problem.
try an absurdly long delay in the batching code to see if this works as expected when there is no contention
complete and open a PR for the batching logic without any of the exec targeted changes so it can be used for volume creation batching and portlayer: lazy disk detach operations #5677

hickeng · 2018-05-29T21:28:56Z

Having paused most of this work to complete #7969 I've just picked up some testing again and observed a failure to detect exec of an exec with the palliative locking in place.

It looks as though we're performing a reload at the same time as handleSessionExit. We have the following which I presume is a mix of two threads (initializing session blocking on the lock held by handleSessionExit):

May 29 2018 20:35:28.484Z DEBUG [ END ]  [vic/lib/tether.establishNonPty:282] [22.034718ms] initializing nonpty handling for session e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d
May 29 2018 20:35:28.496Z DEBUG Launched command with pid 480
May 29 2018 20:35:28.497Z DEBUG [BEGIN]  [vic/pkg/serial.(*RawConn).Write:147]
May 29 2018 20:35:28.501Z DEBUG [ END ]  [vic/pkg/serial.(*RawConn).Write:147] [4.728902ms]
May 29 2018 20:35:28.518Z DEBUG Encoding result of launch for session e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d under key: guestinfo.vice./execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc5
May 29 2018 20:35:28.496Z DEBUG Writer goroutine for stderr returned: %!s(<nil>)
May 29 2018 20:35:28.542Z DEBUG Writer goroutine for stderr exiting
May 29 2018 20:35:28.496Z DEBUG Inspecting children with status change
May 29 2018 20:35:28.553Z DEBUG Reaped process 480, return code: 0
May 29 2018 20:35:28.558Z DEBUG Removed child pid: 480
May 29 2018 20:35:28.535Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.detail.createtime@non-persistent, value: "0"
May 29 2018 20:35:28.502Z DEBUG [0xc420269d70] write "1\n" to 1 writers (err: <nil>)
May 29 2018 20:35:28.583Z DEBUG Writer goroutine for stdout returned: %!s(<nil>)
May 29 2018 20:35:28.589Z DEBUG Writer goroutine for stdout exiting
May 29 2018 20:35:28.578Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.detail.starttime@non-persistent, value: "1527626128"
May 29 2018 20:35:28.611Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.detail.stoptime@non-persistent, value: "0"
May 29 2018 20:35:28.627Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.diagnostics.resurrections@non-persistent, value: "0"
May 29 2018 20:35:28.679Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.status@non-persistent, value: "0"
May 29 2018 20:35:28.698Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.started@non-persistent, value: "true"
May 29 2018 20:35:28.675Z DEBUG [ END ]  [vic/pkg/serial.(*RawConn).Read:100] [368.928273ms]
May 29 2018 20:35:28.762Z DEBUG [BEGIN]  [vic/pkg/serial.(*RawConn).Read:100]
May 29 2018 20:35:28.862Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.runblock@non-persistent, value: "false"
May 29 2018 20:35:28.875Z DEBUG [ END ]  [vic/lib/tether.(*tether).launch:748] [1.670847819s] launching session e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d
May 29 2018 20:35:28.913Z INFO  Loading main configuration
May 29 2018 20:35:28.913Z DEBUG [BEGIN]  [vic/lib/tether.(*tether).handleSessionExit:678] handling exit of session e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d
May 29 2018 20:35:28.913Z DEBUG Waiting on session.wait
May 29 2018 20:35:28.914Z DEBUG Wait on session.wait completed
May 29 2018 20:35:28.915Z DEBUG Calling wait on cmd
May 29 2018 20:35:28.915Z DEBUG Wait returned waitid: no child processes
May 29 2018 20:35:28.916Z DEBUG Calling close on writers
May 29 2018 20:35:28.918Z DEBUG [0xc420269d70] Close on writers
May 29 2018 20:35:28.923Z DEBUG [0xc420269d70] Closing writer &{chanType:attach extraData:[101 51 48 100 100 98 97 98 54 102 57 53 54 97 50 100 100 48 52 55 54 52 52 49 57 102 51 48 55 53 55 49 50 54 102 97 48 49
May 29 2018 20:35:28.949Z DEBUG [0xc420269d70] is a CloseWriter%!(EXTRA *ssh.channel=&{attach [101 51 48 100 100 98 97 98 54 102 57 53 54 97 50 100 100 48 52 55 54 52 52 49 57 102 51 48 55 53 55 49 50 54 102 97 4
May 29 2018 20:35:28.977Z DEBUG [BEGIN]  [vic/pkg/serial.(*RawConn).Write:147]
May 29 2018 20:35:28.941Z DEBUG GuestInfoSource: key: guestinfo.vice./common/name, value: "", error: key not found
May 29 2018 20:35:28.983Z DEBUG GuestInfoSource: key: guestinfo.vice./common/id, value: "70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e009d4abeb4ba52fa2", error: %!s(<nil>)
May 29 2018 20:35:28.985Z DEBUG GuestInfoSource: key: guestinfo.vice./diagnostics/debug, value: "2", error: %!s(<nil>)
May 29 2018 20:35:28.987Z DEBUG GuestInfoSource: key: guestinfo.vice./sessions, value: "70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e009d4abeb4ba52fa2", error: %!s(<nil>)
May 29 2018 20:35:28.988Z DEBUG [ END ]  [vic/pkg/serial.(*RawConn).Write:147] [10.961665ms]
May 29 2018 20:35:28.989Z DEBUG [0xc420269e00] Close on writers
May 29 2018 20:35:28.991Z DEBUG [0xc420269e00] Closing writer &{code:1 ch:0xc4200a5500}
May 29 2018 20:35:28.993Z DEBUG Calling close on reader
May 29 2018 20:35:28.994Z DEBUG [0xc4203d0880] Close on readers
May 29 2018 20:35:28.995Z DEBUG [0xc4203d0880] Closing reader &{chanType:attach extraData:[101 51 48 100 100 98 97 98 54 102 57 53 54 97 50 100 100 48 52 55 54 52 52 49 57 102 51 48 55 53 55 49 50 54 102 97 48 49
May 29 2018 20:35:29.000Z DEBUG [BEGIN]  [vic/pkg/serial.(*RawConn).Write:147]
May 29 2018 20:35:29.000Z DEBUG [ END ]  [vic/pkg/serial.(*RawConn).Write:147] [14.044µs]                                                                                                                         )
May 29 2018 20:35:29.000Z DEBUG Unblocking session: e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d
May 29 2018 20:35:29.002Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.detail.createtime@non-persistent, value: "0"
May 29 2018 20:35:28.992Z DEBUG GuestInfoSource: key: guestinfo.vice./sessions|70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e009d4abeb4ba52fa2/common/ExecutionEnvironment, value: "", error: %!s(<nil>)
May 29 2018 20:35:29.006Z DEBUG GuestInfoSource: key: guestinfo.vice./sessions|70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e009d4abeb4ba52fa2/common/id, value: "70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e00
May 29 2018 20:35:29.009Z DEBUG GuestInfoSource: key: guestinfo.vice./sessions|70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e009d4abeb4ba52fa2/common/name, value: "", error: %!s(<nil>)
May 29 2018 20:35:29.010Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.detail.starttime@non-persistent, value: "1527626128"
May 29 2018 20:35:29.011Z DEBUG GuestInfoSource: key: sessions|70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e009d4abeb4ba52fa2/common/notes, value: "", error: key not found
<snip>
May 29 2018 20:35:34.592Z DEBUG Initializing session e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d
May 29 2018 20:35:34.624Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.detail.stoptime@non-persistent, value: "0"
May 29 2018 20:35:34.626Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.diagnostics.resurrections@non-persistent, value: "0"
May 29 2018 20:35:34.628Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.status@non-persistent, value: "0"
May 29 2018 20:35:34.630Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.started@non-persistent, value: "true"
May 29 2018 20:35:34.632Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.runblock@non-persistent, value: "false"
May 29 2018 20:35:34.634Z DEBUG Calling t.ops.HandleSessionExit
May 29 2018 20:35:34.635Z DEBUG [ END ]  [vic/lib/tether.(*tether).handleSessionExit:678] [5.722602311s] handling exit of session e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d

Updating the child reaper to take t.config.Lock() while writing the completion state avoids the StopTime being overwritten, but we still have the problem of the Reload overwriting live session state - the most problematic being the started field.

hickeng · 2018-06-29T20:15:57Z

Moving this out of scope for 1.4.1 now as the mitigation work has merged in the branches noted above (#8101). I've put it into 1.4.2 simply to force discussion about whether there is additional work that can/should be done in a patch release.

zjs · 2018-08-23T02:36:10Z

@lcastellano: Is this completely fixed, or just better mitigated? Is there additional work planned for this? (If not, perhaps we should lower the priority and move back to the backlog.)

renmaosheng · 2018-08-27T09:01:10Z

https://ci-vic.vmware.com/vmware/vic/19877/7
failed in 1-38-Docker-Exec -> Concurrent Simple Exec
it seems the exec succeeds, but the output is not copied to stout.

renmaosheng · 2018-09-04T02:24:46Z

https://ci-vic.vmware.com/vmware/vic/19947/7 re-occurred in 19947.

hickeng · 2018-09-05T19:48:30Z

@zjs it's just better mitigated. No additional work planned at this time for 1.4.3. Moving to backlog.

@renmaosheng I'm going to defer any investigation for the prior two comments as we were seeing odd network level issues over this timeframe. If it occurs again we will open a dedicated issue for the output as it's not part of the same symptom set covered by this issue.

renmaosheng · 2018-09-06T02:38:33Z

@hickeng thanks, will file a new issue once we see it happens in future.

renmaosheng · 2018-09-11T13:44:00Z

Hi George, we observed several times failure of the 1-38-Docker-Exec -> Concurrent Simple Exec test(https://ci-vic.vmware.com/vmware/vic/20001/7 ), some of the 'exec ls' not returning the content, but return value is Zero. Is this the same issue you are investigating in the PR? current test starting 50 processes, it will happen intermittently, do you think it is OK to move the test to scenario? but not in the ci-integration to prevent from build generation?

Concurrent Simple Exec                                                | FAIL |
'' does not contain 'bin'

luwang-vmware · 2018-11-02T08:50:25Z

Intermittently hit the issue of ''' does not contain 'bin'' in 1-38-Docker-Exec -> Concurrent Simple Exec test. Latest result is in https://ci-vic.vmware.com/vmware/vic/20267/.

hickeng · 2018-11-03T00:12:35Z

Triage of the prior comment, #7410 (comment)

Starting point:
log.html shows that the target container is Exec-simple-20267-2329 and it's the 37th iteration of the exec that failed - we do not know if subsequent iterations failed from this log output.

Locating the opid:
The test does not do anything useful for differentiating between iterations such as packing the index into an environment variable or similar. We have the Start/End/Elapsed time: 20181101 22:03:25.277 / 20181101 22:03:25.323 / 00:00:00.046 - a quick check of the VCH logs shows this time is not useful to us (the minutes to no align with the operation in question - the hours are UTC-7 at a glance).

As such we can do the following:

locate the 37th ExecCreate to be dispatched to the portlayer from personality.log
locate either:
1. the 37th exec to be created in the portlayer log
2. the first exec that returned 0 bytes of output from the portlayer log
try to locate the 37th iteration in tether to get the exec ID from tether.debug
look for an exec that did not return data from the portlayer (as this has effective logging for the purpose)

From tether.debug - found exec ID a4b931bcace3bd6e9e463fd42727985d0ad4f65aef1e331e12fa521fb2824532 as the 37th via regexp Encoding result of launch for session .* under key: guestinfo.vice./execs (expression identified simply by looking at tether.debug for a string unique to launch of exec processes.

From portlayer looking for 0 byte return:

find the time range we're interested in - search for cID 84ce478f049f and extract all operations for that container into a different file.
we expected to see the following pairs of lines for stdout/err when a session is closed (as an example) - the ID here is the exec ID, not the container ID:

DEBUG Finished streaming stdout for f6f2ab82587f230f35deb1c2464c3818b249c175e982c74484930fec39dec143 (unwrapped bytes: 71)
DEBUG Finished streaming stderr for f6f2ab82587f230f35deb1c2464c3818b249c175e982c74484930fec39dec143 (unwrapped bytes: 0)

I only find 96 matches, but given we launched 49 sessions, each with a stdout/err pair I was expected to get 98. We need to confirm how many execs were run - using tether.debug as a sanity check for how many processes were launched I search for Launched command with pid and find 50 matches (one container process, 49 exec processes) as expected.
Figuring out which exec session is missing - we run a comparison of the exec sessions in tether vs the portlayer log. We take a fragment of the portlayer log from the container creation, through to the last reported update for the container (extraconfig id: 84ce478f0) and the final inspect after that:

$ cat tether.debug | awk -F '|' '/Encoding result of launch for session .* under key: guestinfo.vice.\/execs/ {print$2}' | sed 's/@.*//' | sort -u > execs.list
$ cat pl-fragment.log | awk '/DEBUG Finished streaming stdout for/ {print$10}' | sort -u > finished.list
$ diff execs.list finished.list
27d26
< 8adc6e0173de182343cd6d991187173f04e30f67e82a7734f62dd863765d2b79

Checking tether.debug (which is awkward because it doesn't report opIDs for differentiating between threads) we see the following ids for the stdout/err/in MultiReader/MultiWriter - we expect these to have streams from an ssh session added to the when the attach succeeds:

Nov  2 2018 05:03:43.156Z DEBUG Initializing session 8adc6e0173de182343cd6d991187173f04e30f67e82a7734f62dd863765d2b79
Nov  2 2018 05:03:43.159Z DEBUG [BEGIN]  [main.(*operations).SessionLog:85] configure session log writer
Nov  2 2018 05:03:43.160Z WARN  unable to log more than one session concurrently to persistent logging
Nov  2 2018 05:03:43.162Z DEBUG [0xc4202b3260] created multiwriter
Nov  2 2018 05:03:43.164Z DEBUG [0xc4202b3320] created multiwriter
Nov  2 2018 05:03:43.165Z DEBUG [ END ]  [main.(*operations).SessionLog:85] [6.139973ms] configure session log writer
Nov  2 2018 05:03:43.166Z DEBUG [0xc420481740] created multireader

Searching for the MultiWriters we get the following:
stdout - if there are no writers present then pkg/dio/writer will skip writing entirely, leading to the log message of write "" to 0 writers - this is logging what was relayed to the added writers, not what was passed into the Write call:

	Line 5135: Nov  2 2018 05:03:43.162Z DEBUG [0xc4202b3260] created multiwriter
	Line 10755: Nov  2 2018 05:04:07.367Z DEBUG [0xc4202b3260] write "" to 0 writers (err: <nil>)
	Line 10779: Nov  2 2018 05:04:07.443Z DEBUG [0xc4202b3260] added writer - now 1 writers
	Line 10780: Nov  2 2018 05:04:07.444Z DEBUG [0xc4202b3260] Writer 0 [0xc42038a9c0]
	Line 10808: Nov  2 2018 05:04:07.500Z DEBUG [0xc4202b3260] Close on writers

where as, in contrast, we have the following for a successful exec (stdout):

	Line 834: Nov  2 2018 05:03:26.809Z DEBUG [0xc4202b3170] created multiwriter
	Line 891: Nov  2 2018 05:03:28.130Z DEBUG [0xc4202b3170] added writer - now 1 writers
	Line 892: Nov  2 2018 05:03:28.131Z DEBUG [0xc4202b3170] Writer 0 [0xc42038b980]
	Line 927: Nov  2 2018 05:03:28.195Z DEBUG [0xc4202b3170] write "bin\ndev\netc\nhome\nlib\nlost+found\nmnt\nproc\nroot\nrun\nsbin\nsys\ntmp\nusr\nvar\n" to 1 writers (err: <nil>)
	Line 951: Nov  2 2018 05:03:28.246Z DEBUG [0xc4202b3170] Close on writers

Checking the portlayer we see that we timed out waiting for the interactive connection to complete:

Nov  2 2018 05:04:49.905Z DEBUG [BEGIN]  [vic/lib/portlayer/attach/communication.(*interaction).Unblock:227]
Nov  2 2018 05:04:49.905Z ERROR failed to unblock the other side: EOF
...
Nov  2 2018 05:04:53.522Z ERROR attach connector: Connection not found error for id:8adc6e0173de182343cd6d991187173f04e30f67e82a7734f62dd863765d2b79: context deadline exceeded

The EOF error when trying to unblock the connection is unusual - normally this would be due to the SSH connection having dropped, but there's no evidence of that having happened, no renegotiation of the connection.

Checking in the tether to see whether we received the unblock request we did, but again there's the unexpected EOF error:

Nov  2 2018 05:04:07.610Z INFO  Received UnblockReq for 8adc6e0173de182343cd6d991187173f04e30f67e82a7734f62dd863765d2b79
Nov  2 2018 05:04:07.612Z DEBUG Sending channel request reply true back
Nov  2 2018 05:04:07.613Z WARN  Failed replying to a channel request: EOF

Following this logic - we look for where we block waiting for the unblock request, and we find that we do not! There is no message about it for the 8adc6 session.
We check whether the session was configured to block, and it's not - runblock is set to false

	Line 9403: Nov  2 2018 05:04:02.384Z DEBUG GuestInfoSource: key: guestinfo.vice..execs|8adc6e0173de182343cd6d991187173f04e30f67e82a7734f62dd863765d2b79.runblock@non-persistent, value: "false", error: %!s(<nil>)
	Line 9404: Nov  2 2018 05:04:02.387Z DEBUG GuestInfoSource: key: guestinfo.vice./execs|8adc6e0173de182343cd6d991187173f04e30f67e82a7734f62dd863765d2b79/active@non-persistent, value: "true", error: %!s(<nil>)

The runblocking is set by a call to portalyer:attach.Bind() during the same set of calls which activates the session (and we can see the session is set to active in the keys just above):

Nov  2 2018 05:04:02.435Z DEBUG [BEGIN]  [vic/lib/portlayer/attach.Bind:26]
Nov  2 2018 05:04:02.435Z DEBUG Found a device with desired backing: &types.VirtualSerialPort{VirtualDevice:types.VirtualDevice{DynamicData:types.DynamicData{}, Key:9000, DeviceInfo:(*types.Description)(0xc4231e3560), Backing:(*types.VirtualSerialPortURIBackingInfo)(0xc422731d70), Connectable:(*types.VirtualDeviceConnectInfo)(0xc4231e3a00), SlotInfo:types.BaseVirtualDeviceBusSlotInfo(nil), ControllerKey:400, UnitNumber:(*int32)(0xc4227b5cdc)}, YieldOnPoll:true}
Nov  2 2018 05:04:02.435Z DEBUG Already in the desired state, (connected: true, proxyURI: telnet://10.158.214.94:2377)
Nov  2 2018 05:04:02.435Z DEBUG [ END ]  [vic/lib/portlayer/attach.Bind:26] [113.678µs]

I attempted to confirm the value was set in vmware.log, but that file has been throttled.

I suspect that this is the only exec that's failed to return output - that's the behaviour expected based on what's observed in the logs, but the test needs updating to all explicit confirmation.

Tasks:

discover why the runblock state was not set to true for exec session 8adc6e0173
- we could use the same mechanism as in this PR to log the actually command sent from endpointVM to VC. We may also need verbose logging on VC and/or ESX to capture what got applied there.
- I suspect this is an example of guestinfo updates being dropped by ESX when there's a high rate of them as observed when the guest sets the launch status of exec sessions.
update test to continue processing exec returns, even if one fails - this makes for easier diagnostics
update test to inject index in environment variable for easier correlation between client and server

luwang-vmware · 2018-11-05T06:04:41Z

@hickeng
for the following request, I would like to double confirm with you that where do you mean to inject index? Do you mean to inject index in the docker exec process? such as using "docker exec -v index=** <container_id> sh -c 'echo $index; time /bin/ls'" to instead the existing one ?

update test to inject index in environment variable for easier correlation between client and server

hickeng · 2018-11-05T23:50:51Z

@luwang-vmware I think you can use docker exec -e index=${idx} <container_id> /bin/ls - all we need is a way of correlating the ContainerCreate call with the failing attempt in the docker personality log. An environment variable should be effective without actually changing the behaviour of the test.

luwang-vmware · 2018-11-08T05:00:53Z

@hickeng
Updated the test scripts with the following requests. And the issue hit again in CI testing. Detail log is in https://ci-vic.vmware.com/vmware/vic/20326/7.
We can see that only the 29th interation of exec hit issue.

update test to continue processing exec returns, even if one fails - this makes for easier diagnostics
update test to inject index in environment variable for easier correlation between client and server

luwang-vmware · 2018-11-30T03:34:09Z

The issue is hit again in recent CI testing, detail logs is in https://ci-vic.vmware.com/vmware/vic/20509/9.
From the logs, it shows only the 36th iteration of exec has the issue.

renmaosheng · 2019-01-03T05:39:42Z

moving to 1.5.1 for continuous investigation.

matthewavery added component/portlayer/execution component/portlayer/interaction priority/p1 kind/defect Behavior that is inconsistent with what's intended team/foundation priority/p2 and removed priority/p1 priority/p2 labels Feb 27, 2018

matthewavery mentioned this issue Feb 28, 2018

1-38-Docker-Exec failure in Exec During PowerOff test #6744

Closed

mdubya66 added priority/p2 and removed priority/p1 labels Mar 12, 2018

hickeng added priority/p0 source/customer Reported by a customer, directly or via an intermediary and removed priority/p2 labels Apr 26, 2018

matthewavery self-assigned this Apr 26, 2018

anchal-agrawal added this to the Sprint 31 Foundation milestone Apr 26, 2018

sgairo modified the milestones: Sprint 31 Foundation, Sprint 32 Foundation May 8, 2018

matthewavery assigned hickeng and unassigned matthewavery May 8, 2018

sgairo modified the milestones: Sprint 32 Foundation, Sprint 33 Foundation May 23, 2018

This was referenced Jun 26, 2018

Inhibit issues with concurrent execs #8098

Merged

Cherry-pick a variety of changes in 1.4.1 (concurrent exec) #8101

Merged

hickeng assigned lcastellano and unassigned hickeng Aug 2, 2018

zjs removed this from the Sprint 33 Foundation milestone Aug 23, 2018

renmaosheng added priority/p1 and removed priority/p0 labels Sep 4, 2018

zjs mentioned this issue Sep 10, 2018

WIP: Validate full set after delta property match fail #7470

Closed

hickeng mentioned this issue Sep 10, 2018

unable to wait for process launch status: container VM has unexpectedly powered off #6700

Open

hickeng assigned hickeng and unassigned lcastellano Nov 2, 2018

hickeng added the source/ci Found via a continuous integration failure label Nov 2, 2018

hickeng mentioned this issue Nov 2, 2018

Container stuck at starting state #8359

Closed

3 tasks

hickeng removed their assignment Nov 3, 2018

ading007 pushed a commit to ading007/vic that referenced this issue Nov 5, 2018

Debug script 1-38-Docker-Exec.robot for issue vmware#7410

dd1a62b

ading007 mentioned this issue Nov 5, 2018

[No Review] Debug script 1-38-Docker-Exec.robot for issue #7410 #8361

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent Exec tests intermittently fail with various errors #7410

Concurrent Exec tests intermittently fail with various errors #7410

matthewavery commented Feb 27, 2018 •

edited by hickeng

Loading

mdubya66 commented Mar 12, 2018

mhagen-vmware commented Mar 12, 2018

hickeng commented Apr 26, 2018

hickeng commented May 23, 2018

hickeng commented May 24, 2018 •

edited

Loading

hickeng commented May 25, 2018

hickeng commented May 29, 2018

hickeng commented Jun 29, 2018 •

edited

Loading

zjs commented Aug 23, 2018

renmaosheng commented Aug 27, 2018

renmaosheng commented Sep 4, 2018

hickeng commented Sep 5, 2018 •

edited

Loading

renmaosheng commented Sep 6, 2018

renmaosheng commented Sep 11, 2018 •

edited

Loading

luwang-vmware commented Nov 2, 2018

hickeng commented Nov 3, 2018 •

edited by luwang-vmware

Loading

luwang-vmware commented Nov 5, 2018

hickeng commented Nov 5, 2018

luwang-vmware commented Nov 8, 2018 •

edited

Loading

luwang-vmware commented Nov 30, 2018

renmaosheng commented Jan 3, 2019

Concurrent Exec tests intermittently fail with various errors #7410

Concurrent Exec tests intermittently fail with various errors #7410

Comments

matthewavery commented Feb 27, 2018 • edited by hickeng Loading

mdubya66 commented Mar 12, 2018

mhagen-vmware commented Mar 12, 2018

hickeng commented Apr 26, 2018

hickeng commented May 23, 2018

hickeng commented May 24, 2018 • edited Loading

hickeng commented May 25, 2018

hickeng commented May 29, 2018

hickeng commented Jun 29, 2018 • edited Loading

zjs commented Aug 23, 2018

renmaosheng commented Aug 27, 2018

renmaosheng commented Sep 4, 2018

hickeng commented Sep 5, 2018 • edited Loading

renmaosheng commented Sep 6, 2018

renmaosheng commented Sep 11, 2018 • edited Loading

luwang-vmware commented Nov 2, 2018

hickeng commented Nov 3, 2018 • edited by luwang-vmware Loading

luwang-vmware commented Nov 5, 2018

hickeng commented Nov 5, 2018

luwang-vmware commented Nov 8, 2018 • edited Loading

luwang-vmware commented Nov 30, 2018

renmaosheng commented Jan 3, 2019

matthewavery commented Feb 27, 2018 •

edited by hickeng

Loading

hickeng commented May 24, 2018 •

edited

Loading

hickeng commented Jun 29, 2018 •

edited

Loading

hickeng commented Sep 5, 2018 •

edited

Loading

renmaosheng commented Sep 11, 2018 •

edited

Loading

hickeng commented Nov 3, 2018 •

edited by luwang-vmware

Loading

luwang-vmware commented Nov 8, 2018 •

edited

Loading