Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't dump stack of parent when child process exits unexpectedly #4461

Closed
danmoseley opened this issue May 17, 2023 · 7 comments · Fixed by #4493
Closed

don't dump stack of parent when child process exits unexpectedly #4461

danmoseley opened this issue May 17, 2023 · 7 comments · Fixed by #4493
Assignees

Comments

@danmoseley
Copy link
Member

danmoseley commented May 17, 2023

I'm not sure of configuration, but I've seen this for several years: when a child test process terminates unexpectedly (perhaps because it hung and was killed) then the test infra dumps the stack trace of the parent process when the connection drops. The example below came from running some tests in Azure DevOps yesterday.

This stack trace ought to be suppressed. It is not useful to anyone except someone debugging the vstest parent process, as far as I know. Each time I see it, I start analyzing it until I remember that it is not relevant. In the case below, I guess Azure DevOps (?) printed "Please inspect the call stack above, if available, to get more information about where the exception originated from." which makes the confusion worse.

Suggestion: do not dump the callstack from the parent when there's an exception inside NotifyDataAvailable(). It causes more confusion than value. Instead, just print something like "The test process exited unexpectedly."

cc @RussKie

  Passed Microsoft.Extensions.Diagnostics.ResourceMonitoring.Windows.Test.WindowsUtilizationExtensionsTest.AddWindowsCounters_Adds_WindowsCounters_To_ServiceCollection [< 1 ms]
[xUnit.net 00:05:31.65] Microsoft.Extensions.Diagnostics.ResourceMonitoring.Tests: [Long Running Test] 'Microsoft.Extensions.Diagnostics.ResourceMonitoring.Test.ResourceUtilizationTrackerServiceTest.ResourceUtilizationTracker_WhenInitializedWithZeroSnapshots_ReportsHighCpuSpikesThenConvergeInFewCycles', Elapsed: 00:05:29
The active test run was aborted. Reason: Test host process crashed
Data collector 'Blame' message: The specified inactivity time of 6 minutes has elapsed. Collecting hang dumps from testhost and its child processes.
Data collector 'Blame' message: Dumping 5976 - testhost.

Test Run Aborted with error System.Exception: One or more errors occurred.
 ---> System.Exception: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host..
 ---> System.Exception: An existing connection was forcibly closed by the remote host.
   at System.Net.Sockets.NetworkStream.Read(Span`1 buffer)
   --- End of inner exception stack trace ---
   at System.Net.Sockets.NetworkStream.Read(Span`1 buffer)
   at System.Net.Sockets.NetworkStream.ReadByte()
   at System.IO.BinaryReader.Read7BitEncodedInt()
   at System.IO.BinaryReader.ReadString()
   at Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.LengthPrefixCommunicationChannel.NotifyDataAvailable() in /_/src/Microsoft.TestPlatform.CommunicationUtilities/LengthPrefixCommunicationChannel.cs:line 90
   at Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.TcpClientExtensions.MessageLoopAsync(TcpClient client, ICommunicationChannel channel, Action`1 errorHandler, CancellationToken cancellationToken) in /_/src/Microsoft.TestPlatform.CommunicationUtilities/TcpClientExtensions.cs:line 53
   --- End of inner exception stack trace ---.
Total tests: Unknown
     Passed: 115
    Skipped: 74
 Total time: 6.3094 Minutes
Results File: D:\a\_work\1\s\artifacts\TestResults\Release\Microsoft.Extensions.Diagnostics.ResourceMonitoring.Tests_net8.0_x64.trx
Html test results file : D:\a\_work\1\s\artifacts\TestResults\Release\Microsoft.Extensions.Diagnostics.ResourceMonitoring.Tests_net8.0_x64.html

The active Test Run was aborted because the host process exited unexpectedly. Please inspect the call stack above, if available, to get more information about where the exception originated from.
The test running when the crash occurred: 
Microsoft.Extensions.Diagnostics.ResourceMonitoring.Test.ResourceUtilizationTrackerServiceTest.ResourceUtilizationTracker_WhenInitializedWithZeroSnapshots_ReportsHighCpuSpikesThenConvergeInFewCycles
@microsoft-github-policy-service microsoft-github-policy-service bot added the needs-triage This item should be discussed in the next triage meeting. label May 17, 2023
@danmoseley
Copy link
Member Author

danmoseley commented May 17, 2023

Looks like this maybe means something liek when CommonResources.TestHostProcessCrashed

reason = CommonResources.TestHostProcessCrashed;

then don't include the exception in the TestRunCompleteEventArgs, or something.

var completeArgs = new TestRunCompleteEventArgs(null, false, true, exception, null, TimeSpan.Zero);

is the kind of thing I mean.

@MarcoRossignoli MarcoRossignoli added enhancement and removed needs-triage This item should be discussed in the next triage meeting. labels May 17, 2023
@MarcoRossignoli
Copy link
Contributor

MarcoRossignoli commented May 17, 2023

Hi @danmoseley I agree with you that error doesn't mean nothing pretty always only that the testhost process is not exited gracefully but the reason is usually a real crash or a forced crash for a requested dump.

@nohwnd what you think? I think it would be fine maybe log in the file but catch in some way and don't flow to the console.

cc: @jakubch1 @Evangelink

@RussKie
Copy link
Member

RussKie commented May 17, 2023

It could be great to get the source of the crash, if possible. E.g., if the test was killed by timeout.

@MarcoRossignoli
Copy link
Contributor

@RussKie in the output at the moment there's the reason...the issue is that is "mixed" with an useless and out of the context stack trace.

@danmoseley
Copy link
Member Author

Right, we're all in agreement that the output should include all possible info about the test process, including any callstack, exit code, whether it timed out, etc. We should suppress the internal details of the parent behavior, which is irrelevant and misleading, so that the customer relevant info is clearer.

@nohwnd
Copy link
Member

nohwnd commented May 24, 2023

@marco Yes.

@nohwnd nohwnd self-assigned this May 25, 2023
@nohwnd
Copy link
Member

nohwnd commented May 29, 2023

The error is raised here

Which calls StopOnError, which triggers .Disconnected event, that we subscribe to on ICommunicationEndPoint and trigger TestRequestSender.OnTestRunAbort, which forwards the error as part of HandleRunCompleted completeArgs.

We can comment the passing of error where it is collected and we won't get it to show up in the output. Not sure if that is the best place to do it, but ignoring the error later is also not great because different transports can raise different errors.

Here after the patch we get testhost error that was written to testhostprocess ErrorOutput, but not the transport error:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants