-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent Exec tests intermittently fail with various errors #7410
Comments
Making p2 for now. If it appears more frequently we can raise it again |
We will not see this fail because the tests are currently disabled: |
Now blocking a product go-live. My suggestion for addressing the collisions is to add operation batching. The first vsphere operation will be dispatched, and if slow will cause the others to batch up behind it. The next operation issued will be the composite of the blocked operations. |
Batching mechanism is written, and now needs extracting from in-place into a distinct package as this also applies to volume create, disk pull, and host affinity. |
I have batching code that is somewhat effective. We still legitimately see concurrent modifications at the port layer level, but we should not be seeing them propagate all the way back to the user in comparatively light loads, yet we are. The following is needed to determine what's occurring:
Also noted from inspecting the logs - there's way too many task.Inspect operations happening. It may be that @matthewavery work on #7969 will address this, but if not then I should track it down. Also need to add the group serialization logic, or we will rarely actually hit the batching path. Currently exercising it via a hardcoded delay but that's not viable for anything except testing. |
I've added #7370 as a dependency because there's something in that space needed to correctly render the opid parent/child relationship in addition to the parentID being passed via swagger. These two issues are added as dependencies as untangling the cross component concurrent operations is too unwieldy for effective progress in it's current form. I am going to:
|
Having paused most of this work to complete #7969 I've just picked up some testing again and observed a failure to detect exec of an exec with the palliative locking in place. It looks as though we're performing a reload at the same time as handleSessionExit. We have the following which I presume is a mix of two threads (initializing session blocking on the lock held by handleSessionExit):
Updating the child reaper to take |
Moving this out of scope for 1.4.1 now as the mitigation work has merged in the branches noted above (#8101). I've put it into 1.4.2 simply to force discussion about whether there is additional work that can/should be done in a patch release. |
@lcastellano: Is this completely fixed, or just better mitigated? Is there additional work planned for this? (If not, perhaps we should lower the priority and move back to the backlog.) |
https://ci-vic.vmware.com/vmware/vic/19877/7 |
https://ci-vic.vmware.com/vmware/vic/19947/7 re-occurred in 19947. |
@zjs it's just better mitigated. No additional work planned at this time for 1.4.3. Moving to backlog. @renmaosheng I'm going to defer any investigation for the prior two comments as we were seeing odd network level issues over this timeframe. If it occurs again we will open a dedicated issue for the output as it's not part of the same symptom set covered by this issue. |
@hickeng thanks, will file a new issue once we see it happens in future. |
Hi George, we observed several times failure of the 1-38-Docker-Exec -> Concurrent Simple Exec test(https://ci-vic.vmware.com/vmware/vic/20001/7 ), some of the 'exec ls' not returning the content, but return value is Zero. Is this the same issue you are investigating in the PR? current test starting 50 processes, it will happen intermittently, do you think it is OK to move the test to scenario? but not in the ci-integration to prevent from build generation?
|
Intermittently hit the issue of ''' does not contain 'bin'' in 1-38-Docker-Exec -> Concurrent Simple Exec test. Latest result is in https://ci-vic.vmware.com/vmware/vic/20267/. |
Triage of the prior comment, #7410 (comment) Starting point: Locating the opid: As such we can do the following:
From tether.debug - found exec ID From portlayer looking for 0 byte return:
Checking tether.debug (which is awkward because it doesn't report opIDs for differentiating between threads) we see the following ids for the stdout/err/in MultiReader/MultiWriter - we expect these to have streams from an ssh session added to the when the attach succeeds:
Searching for the MultiWriters we get the following:
where as, in contrast, we have the following for a successful exec (stdout):
Checking the portlayer we see that we timed out waiting for the interactive connection to complete:
The Checking in the tether to see whether we received the unblock request we did, but again there's the unexpected EOF error:
Following this logic - we look for where we block waiting for the unblock request, and we find that we do not! There is no message about it for the 8adc6 session.
The runblocking is set by a call to portalyer:attach.Bind() during the same set of calls which activates the session (and we can see the session is set to active in the keys just above):
I attempted to confirm the value was set in vmware.log, but that file has been throttled. I suspect that this is the only exec that's failed to return output - that's the behaviour expected based on what's observed in the logs, but the test needs updating to all explicit confirmation. Tasks:
|
@hickeng
|
@luwang-vmware I think you can use |
@hickeng
|
The issue is hit again in recent CI testing, detail logs is in https://ci-vic.vmware.com/vmware/vic/20509/9. |
moving to 1.5.1 for continuous investigation. |
After PR #6787 exec has become more stable in the concurrent and container shutdown paths. However, it still suffers from some race condition style failures. This has mainly been seen in the CI environment and has passed locally(consecutively) against a high resource complex VSAN deployment. below is a short catalog of the failures that have been seen that will cause intermittent failures:
Simple Concurrent Exec
:Feb 26 2018 16:27:06.142Z ERROR op=301.310: CommitHandler error on handle(c13c27abff7908e6146498325fa944c3) for 46bec94ca67b332d0f924df4f4d7c84168693b876a8aa15658050b2e3b2bf46e: The operation is not allowed in the current state.
Exec During Poweroff Of A Container Performing A Long Running Task
:Exec During Poweroff Of A Container Performing A Short Running Task
:REFERENCE LOGS:
Exec-CI-Failure-1-16518.zip
Exec-CI-Failure-2-16515.zip
Exec-Failure-3-FROM-FULL-CI.html.zip
Exec-Failure-3-FROM-FULL-CI-16510.zip
Currently using the following for basic concurrency testing:
TODO:
The text was updated successfully, but these errors were encountered: