Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent Exec tests intermittently fail with various errors #7410

Open
1 task
matthewavery opened this issue Feb 27, 2018 · 22 comments
Open
1 task

Concurrent Exec tests intermittently fail with various errors #7410

matthewavery opened this issue Feb 27, 2018 · 22 comments
Labels
component/portlayer/execution component/portlayer/interaction kind/defect Behavior that is inconsistent with what's intended priority/p1 source/ci Found via a continuous integration failure source/customer Reported by a customer, directly or via an intermediary team/foundation

Comments

@matthewavery
Copy link
Contributor

matthewavery commented Feb 27, 2018

After PR #6787 exec has become more stable in the concurrent and container shutdown paths. However, it still suffers from some race condition style failures. This has mainly been seen in the CI environment and has passed locally(consecutively) against a high resource complex VSAN deployment. below is a short catalog of the failures that have been seen that will cause intermittent failures:

Simple Concurrent Exec:

  • Failed with a Feb 26 2018 16:27:06.142Z ERROR op=301.310: CommitHandler error on handle(c13c27abff7908e6146498325fa944c3) for 46bec94ca67b332d0f924df4f4d7c84168693b876a8aa15658050b2e3b2bf46e: The operation is not allowed in the current state.

Exec During Poweroff Of A Container Performing A Long Running Task:

  • Failed because all execs actually succeeded(might need to improve test expectations)
  • Container reported being in an invalid state.

Exec During Poweroff Of A Container Performing A Short Running Task:

  • Long running exec returned an rc of 1 even though the output looked present.(looks successful in the portlayer of Exec-Failure-1)

REFERENCE LOGS:

Exec-CI-Failure-1-16518.zip
Exec-CI-Failure-2-16515.zip
Exec-Failure-3-FROM-FULL-CI.html.zip
Exec-Failure-3-FROM-FULL-CI-16510.zip

Currently using the following for basic concurrency testing:

c=1;date;id=`docker ps -q | awk '{print$1}'`; for i in `seq 1 $c`; do docker ^Cec $id /bin/echo /tmp/$i &done;for i in `seq 1 $c`;do wait %$i 2>/dev/null;done;date

TODO:

  • investigate why vCenter reconfigure slows down over time with number of execs. Could be number of extraconfig keys in the VM config (we should not be sending the entire set to VC, but VC may be relaying the entire set to ESX - compare/contrast with ESX directy). Could also be that reconfigures result in accrual of state, irrespective of the keys.
@mdubya66
Copy link
Contributor

Making p2 for now. If it appears more frequently we can raise it again

@mhagen-vmware
Copy link
Contributor

We will not see this fail because the tests are currently disabled:
https://github.com/vmware/vic/blob/master/tests/test-cases/Group1-Docker-Commands/1-38-Docker-Exec.robot

@hickeng hickeng added priority/p0 source/customer Reported by a customer, directly or via an intermediary and removed priority/p2 labels Apr 26, 2018
@hickeng
Copy link
Member

hickeng commented Apr 26, 2018

Now blocking a product go-live.

My suggestion for addressing the collisions is to add operation batching. The first vsphere operation will be dispatched, and if slow will cause the others to batch up behind it. The next operation issued will be the composite of the blocked operations.

@hickeng
Copy link
Member

hickeng commented May 23, 2018

Batching mechanism is written, and now needs extracting from in-place into a distinct package as this also applies to volume create, disk pull, and host affinity.
The outstanding piece for functional demonstration is the merging of the specs. I tried a general approach to begin with but it got too large for this specific issue. Now need to change it to be exec specific.

@hickeng
Copy link
Member

hickeng commented May 24, 2018

I have batching code that is somewhat effective. We still legitimately see concurrent modifications at the port layer level, but we should not be seeing them propagate all the way back to the user in comparatively light loads, yet we are.

The following is needed to determine what's occurring:

  1. add logging of the handle to the port layer rest handlers so the full operation can be tracked (or provide cross component opid which would be better)
  2. determine whether the operations are taking so long that this is a legitimate issue, or
  3. figure out why a random backoff with increasing window is resulting in such frequent collisions - possible error in window implementation?

Also noted from inspecting the logs - there's way too many task.Inspect operations happening. It may be that @matthewavery work on #7969 will address this, but if not then I should track it down.

Also need to add the group serialization logic, or we will rarely actually hit the batching path. Currently exercising it via a hardcoded delay but that's not viable for anything except testing.

@hickeng
Copy link
Member

hickeng commented May 25, 2018

I've added #7370 as a dependency because there's something in that space needed to correctly render the opid parent/child relationship in addition to the parentID being passed via swagger.

These two issues are added as dependencies as untangling the cross component concurrent operations is too unwieldy for effective progress in it's current form.

I am going to:

  • add a palliative semaphore to the exec logic in the personality as a temporary measure, but that's not a fix for the problem.
  • try an absurdly long delay in the batching code to see if this works as expected when there is no contention
  • complete and open a PR for the batching logic without any of the exec targeted changes so it can be used for volume creation batching and portlayer: lazy disk detach operations #5677

@hickeng
Copy link
Member

hickeng commented May 29, 2018

Having paused most of this work to complete #7969 I've just picked up some testing again and observed a failure to detect exec of an exec with the palliative locking in place.

It looks as though we're performing a reload at the same time as handleSessionExit. We have the following which I presume is a mix of two threads (initializing session blocking on the lock held by handleSessionExit):

May 29 2018 20:35:28.484Z DEBUG [ END ]  [vic/lib/tether.establishNonPty:282] [22.034718ms] initializing nonpty handling for session e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d
May 29 2018 20:35:28.496Z DEBUG Launched command with pid 480
May 29 2018 20:35:28.497Z DEBUG [BEGIN]  [vic/pkg/serial.(*RawConn).Write:147]
May 29 2018 20:35:28.501Z DEBUG [ END ]  [vic/pkg/serial.(*RawConn).Write:147] [4.728902ms]
May 29 2018 20:35:28.518Z DEBUG Encoding result of launch for session e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d under key: guestinfo.vice./execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc5
May 29 2018 20:35:28.496Z DEBUG Writer goroutine for stderr returned: %!s(<nil>)
May 29 2018 20:35:28.542Z DEBUG Writer goroutine for stderr exiting
May 29 2018 20:35:28.496Z DEBUG Inspecting children with status change
May 29 2018 20:35:28.553Z DEBUG Reaped process 480, return code: 0
May 29 2018 20:35:28.558Z DEBUG Removed child pid: 480
May 29 2018 20:35:28.535Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.detail.createtime@non-persistent, value: "0"
May 29 2018 20:35:28.502Z DEBUG [0xc420269d70] write "1\n" to 1 writers (err: <nil>)
May 29 2018 20:35:28.583Z DEBUG Writer goroutine for stdout returned: %!s(<nil>)
May 29 2018 20:35:28.589Z DEBUG Writer goroutine for stdout exiting
May 29 2018 20:35:28.578Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.detail.starttime@non-persistent, value: "1527626128"
May 29 2018 20:35:28.611Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.detail.stoptime@non-persistent, value: "0"
May 29 2018 20:35:28.627Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.diagnostics.resurrections@non-persistent, value: "0"
May 29 2018 20:35:28.679Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.status@non-persistent, value: "0"
May 29 2018 20:35:28.698Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.started@non-persistent, value: "true"
May 29 2018 20:35:28.675Z DEBUG [ END ]  [vic/pkg/serial.(*RawConn).Read:100] [368.928273ms]
May 29 2018 20:35:28.762Z DEBUG [BEGIN]  [vic/pkg/serial.(*RawConn).Read:100]
May 29 2018 20:35:28.862Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.runblock@non-persistent, value: "false"
May 29 2018 20:35:28.875Z DEBUG [ END ]  [vic/lib/tether.(*tether).launch:748] [1.670847819s] launching session e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d
May 29 2018 20:35:28.913Z INFO  Loading main configuration
May 29 2018 20:35:28.913Z DEBUG [BEGIN]  [vic/lib/tether.(*tether).handleSessionExit:678] handling exit of session e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d
May 29 2018 20:35:28.913Z DEBUG Waiting on session.wait
May 29 2018 20:35:28.914Z DEBUG Wait on session.wait completed
May 29 2018 20:35:28.915Z DEBUG Calling wait on cmd
May 29 2018 20:35:28.915Z DEBUG Wait returned waitid: no child processes
May 29 2018 20:35:28.916Z DEBUG Calling close on writers
May 29 2018 20:35:28.918Z DEBUG [0xc420269d70] Close on writers
May 29 2018 20:35:28.923Z DEBUG [0xc420269d70] Closing writer &{chanType:attach extraData:[101 51 48 100 100 98 97 98 54 102 57 53 54 97 50 100 100 48 52 55 54 52 52 49 57 102 51 48 55 53 55 49 50 54 102 97 48 49
May 29 2018 20:35:28.949Z DEBUG [0xc420269d70] is a CloseWriter%!(EXTRA *ssh.channel=&{attach [101 51 48 100 100 98 97 98 54 102 57 53 54 97 50 100 100 48 52 55 54 52 52 49 57 102 51 48 55 53 55 49 50 54 102 97 4
May 29 2018 20:35:28.977Z DEBUG [BEGIN]  [vic/pkg/serial.(*RawConn).Write:147]
May 29 2018 20:35:28.941Z DEBUG GuestInfoSource: key: guestinfo.vice./common/name, value: "", error: key not found
May 29 2018 20:35:28.983Z DEBUG GuestInfoSource: key: guestinfo.vice./common/id, value: "70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e009d4abeb4ba52fa2", error: %!s(<nil>)
May 29 2018 20:35:28.985Z DEBUG GuestInfoSource: key: guestinfo.vice./diagnostics/debug, value: "2", error: %!s(<nil>)
May 29 2018 20:35:28.987Z DEBUG GuestInfoSource: key: guestinfo.vice./sessions, value: "70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e009d4abeb4ba52fa2", error: %!s(<nil>)
May 29 2018 20:35:28.988Z DEBUG [ END ]  [vic/pkg/serial.(*RawConn).Write:147] [10.961665ms]
May 29 2018 20:35:28.989Z DEBUG [0xc420269e00] Close on writers
May 29 2018 20:35:28.991Z DEBUG [0xc420269e00] Closing writer &{code:1 ch:0xc4200a5500}
May 29 2018 20:35:28.993Z DEBUG Calling close on reader
May 29 2018 20:35:28.994Z DEBUG [0xc4203d0880] Close on readers
May 29 2018 20:35:28.995Z DEBUG [0xc4203d0880] Closing reader &{chanType:attach extraData:[101 51 48 100 100 98 97 98 54 102 57 53 54 97 50 100 100 48 52 55 54 52 52 49 57 102 51 48 55 53 55 49 50 54 102 97 48 49
May 29 2018 20:35:29.000Z DEBUG [BEGIN]  [vic/pkg/serial.(*RawConn).Write:147]
May 29 2018 20:35:29.000Z DEBUG [ END ]  [vic/pkg/serial.(*RawConn).Write:147] [14.044µs]                                                                                                                         )
May 29 2018 20:35:29.000Z DEBUG Unblocking session: e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d
May 29 2018 20:35:29.002Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.detail.createtime@non-persistent, value: "0"
May 29 2018 20:35:28.992Z DEBUG GuestInfoSource: key: guestinfo.vice./sessions|70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e009d4abeb4ba52fa2/common/ExecutionEnvironment, value: "", error: %!s(<nil>)
May 29 2018 20:35:29.006Z DEBUG GuestInfoSource: key: guestinfo.vice./sessions|70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e009d4abeb4ba52fa2/common/id, value: "70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e00
May 29 2018 20:35:29.009Z DEBUG GuestInfoSource: key: guestinfo.vice./sessions|70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e009d4abeb4ba52fa2/common/name, value: "", error: %!s(<nil>)
May 29 2018 20:35:29.010Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.detail.starttime@non-persistent, value: "1527626128"
May 29 2018 20:35:29.011Z DEBUG GuestInfoSource: key: sessions|70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e009d4abeb4ba52fa2/common/notes, value: "", error: key not found
<snip>
May 29 2018 20:35:34.592Z DEBUG Initializing session e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d
May 29 2018 20:35:34.624Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.detail.stoptime@non-persistent, value: "0"
May 29 2018 20:35:34.626Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.diagnostics.resurrections@non-persistent, value: "0"
May 29 2018 20:35:34.628Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.status@non-persistent, value: "0"
May 29 2018 20:35:34.630Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.started@non-persistent, value: "true"
May 29 2018 20:35:34.632Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.runblock@non-persistent, value: "false"
May 29 2018 20:35:34.634Z DEBUG Calling t.ops.HandleSessionExit
May 29 2018 20:35:34.635Z DEBUG [ END ]  [vic/lib/tether.(*tether).handleSessionExit:678] [5.722602311s] handling exit of session e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d

Updating the child reaper to take t.config.Lock() while writing the completion state avoids the StopTime being overwritten, but we still have the problem of the Reload overwriting live session state - the most problematic being the started field.

@hickeng
Copy link
Member

hickeng commented Jun 29, 2018

Moving this out of scope for 1.4.1 now as the mitigation work has merged in the branches noted above (#8101). I've put it into 1.4.2 simply to force discussion about whether there is additional work that can/should be done in a patch release.

@hickeng hickeng assigned lcastellano and unassigned hickeng Aug 2, 2018
@zjs
Copy link
Member

zjs commented Aug 23, 2018

@lcastellano: Is this completely fixed, or just better mitigated? Is there additional work planned for this? (If not, perhaps we should lower the priority and move back to the backlog.)

@zjs zjs removed this from the Sprint 33 Foundation milestone Aug 23, 2018
@renmaosheng
Copy link
Contributor

https://ci-vic.vmware.com/vmware/vic/19877/7
failed in 1-38-Docker-Exec -> Concurrent Simple Exec
it seems the exec succeeds, but the output is not copied to stout.

@renmaosheng
Copy link
Contributor

https://ci-vic.vmware.com/vmware/vic/19947/7 re-occurred in 19947.

@hickeng
Copy link
Member

hickeng commented Sep 5, 2018

@zjs it's just better mitigated. No additional work planned at this time for 1.4.3. Moving to backlog.

@renmaosheng I'm going to defer any investigation for the prior two comments as we were seeing odd network level issues over this timeframe. If it occurs again we will open a dedicated issue for the output as it's not part of the same symptom set covered by this issue.

@renmaosheng
Copy link
Contributor

@hickeng thanks, will file a new issue once we see it happens in future.

@renmaosheng
Copy link
Contributor

renmaosheng commented Sep 11, 2018

Hi George, we observed several times failure of the 1-38-Docker-Exec -> Concurrent Simple Exec test(https://ci-vic.vmware.com/vmware/vic/20001/7 ), some of the 'exec ls' not returning the content, but return value is Zero. Is this the same issue you are investigating in the PR? current test starting 50 processes, it will happen intermittently, do you think it is OK to move the test to scenario? but not in the ci-integration to prevent from build generation?

Concurrent Simple Exec                                                | FAIL |
'' does not contain 'bin'

@luwang-vmware
Copy link
Contributor

Intermittently hit the issue of ''' does not contain 'bin'' in 1-38-Docker-Exec -> Concurrent Simple Exec test. Latest result is in https://ci-vic.vmware.com/vmware/vic/20267/.

@hickeng hickeng assigned hickeng and unassigned lcastellano Nov 2, 2018
@hickeng hickeng added the source/ci Found via a continuous integration failure label Nov 2, 2018
@hickeng
Copy link
Member

hickeng commented Nov 3, 2018

Triage of the prior comment, #7410 (comment)

Starting point:
log.html shows that the target container is Exec-simple-20267-2329 and it's the 37th iteration of the exec that failed - we do not know if subsequent iterations failed from this log output.

Locating the opid:
The test does not do anything useful for differentiating between iterations such as packing the index into an environment variable or similar. We have the Start/End/Elapsed time: 20181101 22:03:25.277 / 20181101 22:03:25.323 / 00:00:00.046 - a quick check of the VCH logs shows this time is not useful to us (the minutes to no align with the operation in question - the hours are UTC-7 at a glance).

As such we can do the following:

  1. locate the 37th ExecCreate to be dispatched to the portlayer from personality.log
  2. locate either:
    1. the 37th exec to be created in the portlayer log
    2. the first exec that returned 0 bytes of output from the portlayer log
  3. try to locate the 37th iteration in tether to get the exec ID from tether.debug
  4. look for an exec that did not return data from the portlayer (as this has effective logging for the purpose)

From tether.debug - found exec ID a4b931bcace3bd6e9e463fd42727985d0ad4f65aef1e331e12fa521fb2824532 as the 37th via regexp Encoding result of launch for session .* under key: guestinfo.vice./execs (expression identified simply by looking at tether.debug for a string unique to launch of exec processes.

From portlayer looking for 0 byte return:

  1. find the time range we're interested in - search for cID 84ce478f049f and extract all operations for that container into a different file.
  2. we expected to see the following pairs of lines for stdout/err when a session is closed (as an example) - the ID here is the exec ID, not the container ID:
DEBUG Finished streaming stdout for f6f2ab82587f230f35deb1c2464c3818b249c175e982c74484930fec39dec143 (unwrapped bytes: 71)
DEBUG Finished streaming stderr for f6f2ab82587f230f35deb1c2464c3818b249c175e982c74484930fec39dec143 (unwrapped bytes: 0)
  • I only find 96 matches, but given we launched 49 sessions, each with a stdout/err pair I was expected to get 98. We need to confirm how many execs were run - using tether.debug as a sanity check for how many processes were launched I search for Launched command with pid and find 50 matches (one container process, 49 exec processes) as expected.
    Figuring out which exec session is missing - we run a comparison of the exec sessions in tether vs the portlayer log. We take a fragment of the portlayer log from the container creation, through to the last reported update for the container (extraconfig id: 84ce478f0) and the final inspect after that:
$ cat tether.debug | awk -F '|' '/Encoding result of launch for session .* under key: guestinfo.vice.\/execs/ {print$2}' | sed 's/@.*//' | sort -u > execs.list
$ cat pl-fragment.log | awk '/DEBUG Finished streaming stdout for/ {print$10}' | sort -u > finished.list
$ diff execs.list finished.list
27d26
< 8adc6e0173de182343cd6d991187173f04e30f67e82a7734f62dd863765d2b79

Checking tether.debug (which is awkward because it doesn't report opIDs for differentiating between threads) we see the following ids for the stdout/err/in MultiReader/MultiWriter - we expect these to have streams from an ssh session added to the when the attach succeeds:

Nov  2 2018 05:03:43.156Z DEBUG Initializing session 8adc6e0173de182343cd6d991187173f04e30f67e82a7734f62dd863765d2b79
Nov  2 2018 05:03:43.159Z DEBUG [BEGIN]  [main.(*operations).SessionLog:85] configure session log writer
Nov  2 2018 05:03:43.160Z WARN  unable to log more than one session concurrently to persistent logging
Nov  2 2018 05:03:43.162Z DEBUG [0xc4202b3260] created multiwriter
Nov  2 2018 05:03:43.164Z DEBUG [0xc4202b3320] created multiwriter
Nov  2 2018 05:03:43.165Z DEBUG [ END ]  [main.(*operations).SessionLog:85] [6.139973ms] configure session log writer
Nov  2 2018 05:03:43.166Z DEBUG [0xc420481740] created multireader

Searching for the MultiWriters we get the following:
stdout - if there are no writers present then pkg/dio/writer will skip writing entirely, leading to the log message of write "" to 0 writers - this is logging what was relayed to the added writers, not what was passed into the Write call:

	Line 5135: Nov  2 2018 05:03:43.162Z DEBUG [0xc4202b3260] created multiwriter
	Line 10755: Nov  2 2018 05:04:07.367Z DEBUG [0xc4202b3260] write "" to 0 writers (err: <nil>)
	Line 10779: Nov  2 2018 05:04:07.443Z DEBUG [0xc4202b3260] added writer - now 1 writers
	Line 10780: Nov  2 2018 05:04:07.444Z DEBUG [0xc4202b3260] Writer 0 [0xc42038a9c0]
	Line 10808: Nov  2 2018 05:04:07.500Z DEBUG [0xc4202b3260] Close on writers

where as, in contrast, we have the following for a successful exec (stdout):

	Line 834: Nov  2 2018 05:03:26.809Z DEBUG [0xc4202b3170] created multiwriter
	Line 891: Nov  2 2018 05:03:28.130Z DEBUG [0xc4202b3170] added writer - now 1 writers
	Line 892: Nov  2 2018 05:03:28.131Z DEBUG [0xc4202b3170] Writer 0 [0xc42038b980]
	Line 927: Nov  2 2018 05:03:28.195Z DEBUG [0xc4202b3170] write "bin\ndev\netc\nhome\nlib\nlost+found\nmnt\nproc\nroot\nrun\nsbin\nsys\ntmp\nusr\nvar\n" to 1 writers (err: <nil>)
	Line 951: Nov  2 2018 05:03:28.246Z DEBUG [0xc4202b3170] Close on writers

Checking the portlayer we see that we timed out waiting for the interactive connection to complete:

Nov  2 2018 05:04:49.905Z DEBUG [BEGIN]  [vic/lib/portlayer/attach/communication.(*interaction).Unblock:227]
Nov  2 2018 05:04:49.905Z ERROR failed to unblock the other side: EOF
...
Nov  2 2018 05:04:53.522Z ERROR attach connector: Connection not found error for id:8adc6e0173de182343cd6d991187173f04e30f67e82a7734f62dd863765d2b79: context deadline exceeded

The EOF error when trying to unblock the connection is unusual - normally this would be due to the SSH connection having dropped, but there's no evidence of that having happened, no renegotiation of the connection.

Checking in the tether to see whether we received the unblock request we did, but again there's the unexpected EOF error:

Nov  2 2018 05:04:07.610Z INFO  Received UnblockReq for 8adc6e0173de182343cd6d991187173f04e30f67e82a7734f62dd863765d2b79
Nov  2 2018 05:04:07.612Z DEBUG Sending channel request reply true back
Nov  2 2018 05:04:07.613Z WARN  Failed replying to a channel request: EOF

Following this logic - we look for where we block waiting for the unblock request, and we find that we do not! There is no message about it for the 8adc6 session.
We check whether the session was configured to block, and it's not - runblock is set to false

	Line 9403: Nov  2 2018 05:04:02.384Z DEBUG GuestInfoSource: key: guestinfo.vice..execs|8adc6e0173de182343cd6d991187173f04e30f67e82a7734f62dd863765d2b79.runblock@non-persistent, value: "false", error: %!s(<nil>)
	Line 9404: Nov  2 2018 05:04:02.387Z DEBUG GuestInfoSource: key: guestinfo.vice./execs|8adc6e0173de182343cd6d991187173f04e30f67e82a7734f62dd863765d2b79/active@non-persistent, value: "true", error: %!s(<nil>)

The runblocking is set by a call to portalyer:attach.Bind() during the same set of calls which activates the session (and we can see the session is set to active in the keys just above):

Nov  2 2018 05:04:02.435Z DEBUG [BEGIN]  [vic/lib/portlayer/attach.Bind:26]
Nov  2 2018 05:04:02.435Z DEBUG Found a device with desired backing: &types.VirtualSerialPort{VirtualDevice:types.VirtualDevice{DynamicData:types.DynamicData{}, Key:9000, DeviceInfo:(*types.Description)(0xc4231e3560), Backing:(*types.VirtualSerialPortURIBackingInfo)(0xc422731d70), Connectable:(*types.VirtualDeviceConnectInfo)(0xc4231e3a00), SlotInfo:types.BaseVirtualDeviceBusSlotInfo(nil), ControllerKey:400, UnitNumber:(*int32)(0xc4227b5cdc)}, YieldOnPoll:true}
Nov  2 2018 05:04:02.435Z DEBUG Already in the desired state, (connected: true, proxyURI: telnet://10.158.214.94:2377)
Nov  2 2018 05:04:02.435Z DEBUG [ END ]  [vic/lib/portlayer/attach.Bind:26] [113.678µs] 

I attempted to confirm the value was set in vmware.log, but that file has been throttled.

I suspect that this is the only exec that's failed to return output - that's the behaviour expected based on what's observed in the logs, but the test needs updating to all explicit confirmation.

Tasks:

  • discover why the runblock state was not set to true for exec session 8adc6e0173
    • we could use the same mechanism as in this PR to log the actually command sent from endpointVM to VC. We may also need verbose logging on VC and/or ESX to capture what got applied there.
    • I suspect this is an example of guestinfo updates being dropped by ESX when there's a high rate of them as observed when the guest sets the launch status of exec sessions.
  • update test to continue processing exec returns, even if one fails - this makes for easier diagnostics
  • update test to inject index in environment variable for easier correlation between client and server

@hickeng hickeng removed their assignment Nov 3, 2018
@luwang-vmware
Copy link
Contributor

@hickeng
for the following request, I would like to double confirm with you that where do you mean to inject index? Do you mean to inject index in the docker exec process? such as using "docker exec -v index=** <container_id> sh -c 'echo $index; time /bin/ls'" to instead the existing one ?

  • update test to inject index in environment variable for easier correlation between client and server

@hickeng
Copy link
Member

hickeng commented Nov 5, 2018

@luwang-vmware I think you can use docker exec -e index=${idx} <container_id> /bin/ls - all we need is a way of correlating the ContainerCreate call with the failing attempt in the docker personality log. An environment variable should be effective without actually changing the behaviour of the test.

@luwang-vmware
Copy link
Contributor

luwang-vmware commented Nov 8, 2018

@hickeng
Updated the test scripts with the following requests. And the issue hit again in CI testing. Detail log is in https://ci-vic.vmware.com/vmware/vic/20326/7.
We can see that only the 29th interation of exec hit issue.

  • update test to continue processing exec returns, even if one fails - this makes for easier diagnostics
  • update test to inject index in environment variable for easier correlation between client and server

@luwang-vmware
Copy link
Contributor

The issue is hit again in recent CI testing, detail logs is in https://ci-vic.vmware.com/vmware/vic/20509/9.
From the logs, it shows only the 36th iteration of exec has the issue.

@renmaosheng
Copy link
Contributor

moving to 1.5.1 for continuous investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/portlayer/execution component/portlayer/interaction kind/defect Behavior that is inconsistent with what's intended priority/p1 source/ci Found via a continuous integration failure source/customer Reported by a customer, directly or via an intermediary team/foundation
Projects
None yet
Development

No branches or pull requests

10 participants