Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wecsvc stops working after a while #35

Open
bluedefxx opened this issue Apr 2, 2019 · 26 comments
Open

wecsvc stops working after a while #35

bluedefxx opened this issue Apr 2, 2019 · 26 comments

Comments

@bluedefxx
Copy link

bluedefxx commented Apr 2, 2019

Hello,
We are encountring a strange behavior on a Windows 2012 Event Collector.
This server use > 8vCPU and 20GB RAM, monitoring it does not show specific usage peaks.
NXLog is used to fetch logs and send them to a SIEM.

We are using parts of the wef-subscriptions (~40) on +4000 workstations. (customized to filter at Source).

For some (unknown) reasons the Event Collector stops "working" randomly after sometimes 2 days, 3 days , 7 days... By stops working i mean, the service is still running but no events are coming in the forwarded events...

Regarding deployed subscriptions, the only modifications we performed were:

  • Everything in ForwardedEvents
  • ~40 subscriptions
  • MaxItems set to 25 or 50 (depending on some subscripitons)
    25000

    We also tried to perform some customozation recommended in:
    "Windows™ Event Forwarding (WEF) into ArcSight at Scale"

We tried to correlate this issue with user activities but it does not seem to have any link, last stops to send log at 6 in the morning...

The analysis we performed shown the following behavior;

  • The Subscription URL become suddenly inaccessible (404) and generate 2150859027 on client side.
  • BUT WinRM and WSMan are still accessible from the client to the collector (tested with Winrm & Test-WSMan)
  • The usage of the wecutil command (wecutil es/gs/gr) are not possible (process never ends)
  • A deeper analysis using ProcessExplorer shown that all wecsvc.exe process threads are in a "Lock"/Waiting status with 0% CPU and 0% RAM used.

Sometimes we are able to restart the service, sometimes we need to kill it before restarting.

I have multiple questions on this:

  • Your configs use MaxItems set to 1, are we agreeed that performances should be better by buffering a little bit (as we are doing by setting 25 or 50) ?

  • The pro of multiple subscriptions is also to have the capacity to manage ACL separately but it also create 4000*40 registry keys to maintain states in the registry, could this have any impact ?

  • On the fact to have 40 dedicated subscriptions, can't this have any impact on parallel network connections from source initated ? (i mean opening 40 parallel tcp sockets instead of 1 only x 4000 ?)

This is a very interesting topic as large environment deplyment, tunning and troubleshooting are not well (not to say not at all) documented by MS...

Any feedbacks, ideas, and answers on this issue would be more than appreciated and i assume help the community !!!

Kind regards,

@novaksam
Copy link

novaksam commented Apr 2, 2019

I had/am having similar issues; about 2.5k stations and occasionally the collector just stops; the process is still running, but it's not doing anything. I've attempted to work around it with the registry changes in my PR (#34) and some additional tweaking after that but I still seem to be encountering the issue.

@rewt1
Copy link

rewt1 commented Apr 2, 2019

Thank you for this input, did you every experiment the GPO feature to control event per second for each sources ? if yes does it have any impact on performance on customr side ?

Did you used the MaxItems set to 1 ?
What is the Windows version of your WEC ?

@novaksam
Copy link

novaksam commented Apr 2, 2019

@rewt1 I believe I set my max items per second to 10 and I don't recall seeing customer side issues. I haven't modified any of the delivery options, though I did add some suppressions to a couple of rules.

We're running 2016 as our collector (VMWare, 8 CPU, 16GB, SSD SAN); I actually have three collectors setup, one for servers (about 350 clients), one for DCs (9 clients) and one for workstations (+2.5K clients), and the workstation one is the only one exhibiting this issue.

Edit: Just a note: we are utilizing Winlogbeat to forward events from these subscriptions over to our ELK cluster.

@jokezone
Copy link

jokezone commented Apr 2, 2019

Can confirm this same issue was happening on our WEC servers. It happened on WECs with thousands of clients and WECs with just a few hundred. All our event subscriptions are consolidated to a single one for workstations, member servers, and DCs. During troubleshooting, we found that running gpupdate /force from the affected WEC would jumpstart the process and it would begin logging forwarded events again without a reboot. Running gpupdate again would cause it to go back into that funky state (sometimes). In the end, we doubled down on memory/CPU and haven't seen the issue come back. It's definitely buggy. But hey, it's also free ;)

@bluedefxx
Copy link
Author

bluedefxx commented Apr 3, 2019

Thanks @jokezone for your feedback.
On my side i am wondering about the following fact:

  • Considering that each subscription generate an internal Id which is later used in a POST form and that each subscription has diferent settings for pushing events (MaxItems/MaxBatchItems). it is required to open multiple connection each time a deliver is required.
    This behavior is partially confirmed by the fact that if we look at paralllel connections we can see > 150000 entries... (40x4000)

I am wondering if it could be possible to lower the number of parallel connections by "aggregating" the different subscriptions. I am going to perform tests to see if it reduce the volume of parallell connection on the WEC servers.

Any other ideas are welcome !

@novaksam
Copy link

novaksam commented Apr 3, 2019

@jokezone In my instance I don't think it's memory related (even though I've seen the process go as high as 6GB), but if I go higher in CPU I'm gonna start hitting NUMA boundaries on our hypervisors. I typically notice that during high intake, the CPU is maxed, but one the subscriptions seem to catch up the CPU hovers around 60%, so I'm not 100% sure what to make of that.

Another thing I notice with high intake is that the WinRM queue gets full (C:\Windows\System32\LogsFiles\HTTPERR) but the queue size for WinRM is hard coded as 1000, or at least I haven't found a way to increase it.

@bluedefxx I like the idea of combining some of the subscriptions; I actually utilize the additional channels (and wrote two of my own) so that I can tag different types of events within ELK for easier searching, but if we were to combine the subscriptions that would otherwise feed into the same channel (Authentication is one of them) then both those that utilize the custom channels, and those that write to forwarded events would be able to reduce our subscription counts (and subsequent load on the registry + WinRM) with no change in logging behavior.

Edit: I've also played around with the idea of setting up a fourth collector for specific subscriptions, figuring that if I specify multiple collectors via GPO they'll report the requested subscriptions to those collectors; a slight increase in load on the client, but a potential massive decrease in load on a single server. We're playing a balancing game with VM resources ATM, so I haven't pursued this idea yet, as I wanted to see what I could figure out with a single box.

@SpencerLN
Copy link

SpencerLN commented Apr 3, 2019

I opened a case with Microsoft Premier Support due to similar issues that others have reported here. We consistently see stability issues with clients randomly failing to send events or the wecsvc stopping receiving events using the default delivery settings from the subscriptions provided here. I have also tried tweaking the delivery settings with different options and haven't had any of the changes improve the performance significantly.

Microsoft's conclusion after analyzing multiple memory dumps, traces, event logs, etc was that the issue is due to the number of subscriptions on each Event Collection server:

Each client will make a connection for each subscription at the MaxLatency interval. Each persists for about 3.5 minutes and will consume memory for this lifetime.
In the example above we have 50 subscriptions for 443 clients, you can imagine how many connections are being handled.
Our first recommendation is to merge some of those subscriptions by adding more <Select Path> items in the same <QueryList>
You should be able to at least cut the number of subscriptions to less than 10 to have good performances with 443 clients.

I haven't had time to implement their recommendation yet, and am hesitant to implement because we are currently using the additional channels for tagging within ELK and easier filtering within Winlogbeat as well.

@novaksam
Copy link

novaksam commented Apr 3, 2019

@SpencerLN You spent money for this? Nice.

It's just odd that I only seem to have the issue on my large collector; I mean, if that's the 'truth' from MS, perhaps my thinking of having additional collectors to separate subscriptions would ease the burden. I think combining the subscriptions that would otherwise write to the same channel might be the most 'efficient' thing to do... Now if only it didn't take 25 minutes for the event subscription GUI to load :)

@novaksam
Copy link

novaksam commented Apr 3, 2019

I'm going to try combining the authentication subscriptions into a single, combined one.

Here's the xpath I'm going to use:

<QueryList>
  <!-- Authentication -->
  <Query Id="0" Path="Security">
    <!-- 4624: An account was successfully logged on. -->
    <!-- 4625: An account failed to log on. -->
    <!-- 4626: User/Device claims information. -->
    <Select Path="Security">*[System[(EventID &gt;=4624 and EventID &lt;=4626)]]</Select>
    <!-- 4634: An account was successfully logged off. -->
    <!-- 4647: User initiated logoff. -->
    <!-- 4649: A replay attack was detected. -->
    <!-- 4672: Special privileges assigned to a new logon, administrative logins -sa, -ada, etc. -->
    <!-- 4675: SIDs were filtered. -->
    <Select Path="Security">*[System[(EventID=4634 or EventID=4647 or EventID=4649 or EventID=4672 or EventID=4675)]]</Select>
    <!-- 4774: An account was mapped for logon. -->
    <!-- 4775: An account could not be mapped for logon. -->
    <!-- 4776: The computer attempted to validate the credentials for an account. -->
    <!-- 4777: The domain controller failed to validate the credentials for an account. -->
    <!-- 4778: A session was reconnected to a Window Station. -->
    <!-- 4779: A session was disconnected from a Window Station. -->
    <Select Path="Security">*[System[(EventID &gt;=4774 and EventID &lt;=4779)]]</Select>
    <!-- 4800 The workstation was locked. -->
    <!-- 4801 The workstation was unlocked. -->
    <!-- 4802 The screen saver was invoked. -->
    <!-- 4803 The screen saver was dismissed. -->
    <Select Path="Security">*[System[(EventID &gt;=4800 and EventID &lt;=4803)]]</Select>
    <!-- 4964: Special groups have been assigned a new logon. -->
    <Select Path="Security">*[System[(EventID=4964)]]</Select>
    <!-- 5378 The requested credentials delegation was disallowed by policy. -->
    <Select Path="Security">*[System[(EventID=5378)]]</Select>
    <!-- Suppress SECURITY_LOCAL_SYSTEM_RID A special account used by the OS, noisy -->
    <Suppress Path="Security">*[EventData[Data[1]="S-1-5-18"]]</Suppress>
  </Query>
  <!-- Lockouts -->
  <Query Id="1" Path="Security">
    <!-- For Domain Accounts event is created on DC-->
    <!-- For Local Accounts event is created locally-->
    <!-- 4740: Account Lockouts -->
    <Select Path="Security">*[System[Provider[@Name='Microsoft-Windows-Security-Auditing'] and (Level=4 or Level=0) and EventID=4740]]</Select>
  </Query>
  <!-- NTLM auditing -->
  <Query Id="2" Path="Microsoft-Windows-Authentication/AuthenticationPolicyFailures-DomainController">
    <Select Path="Microsoft-Windows-Authentication/AuthenticationPolicyFailures-DomainController">*[System[Provider[@Name='Microsoft-Windows-NTLM']]]</Select>
    <Select Path="Microsoft-Windows-Authentication/ProtectedUserFailures-DomainController">*[System[Provider[@Name='Microsoft-Windows-NTLM']]]</Select>
    <Select Path="Microsoft-Windows-NTLM/Operational">*[System[Provider[@Name='Microsoft-Windows-NTLM']]]</Select>
  </Query>
  <!-- Explicit credential auditing -->
  <Query Id="3" Path="Security">
    <!-- Logging on with Explicit Credentials -->
    <!-- taskhost.exe (scheduled tasks) is filtered as this is normal behavior. -->
    <Select Path="Security">*[System[Provider[@Name='Microsoft-Windows-Security-Auditing'] and (Level=4 or Level=0) and EventID=4648]] and *[EventData[Data[@Name='ProcessName']!='C:\Windows\System32\taskhost.exe']]</Select>
  </Query>
  <!-- Kerberos auditing -->
  <Query Id="4" Path="Security">
    <!-- 4768 - A Kerberos authentication ticket (TGT) was requested -->
    <!-- 4769 - A Kerberos service ticket was requested -->
    <!-- 4770 - A Kerberos service ticket was renewed -->
    <!-- 4771 - A Kerberos pre-authentication failed. -->
    <!-- 4772 - A Kerberos authentication ticket request failed. -->
    <!-- 4773 - A Kerberos service ticket request failed. -->
    <Select Path="Security">*[System[(EventID=4768 or EventID=4769 or EventID=4770 or EventID=4771 or EventID=4772 or EventID=4773)]]</Select>
  </Query>
</QueryList>

@SpencerLN
Copy link

SpencerLN commented Apr 4, 2019

@novaksam I will be very interested in how your test of combining subscriptions goes. We have a rather large WEC footprint (~20 servers) and it has been a major hassle to manage, so we are starting to look at other options given the apparent instability of WEF at a higher scale.

I am really hoping to be able to get WEF working, which resulted in opening the case with MSFT. I guess time will tell if their recommendation is able to make a big enough impact to make our client to WEC ratio work.

@ghost
Copy link

ghost commented Apr 4, 2019

Just wondering if you guys set the refresh interval for the subscription manager GPO?

Server=http://<FQDN>:5985/wsman/SubscriptionManager/WEC,Refresh=60

This sets the frequency in seconds for how often the source computers query the collector for subscription updates. My environment is not nearly as large, but I have this set and have not experienced a stoppage in event forwarding.

@novaksam
Copy link

novaksam commented Apr 4, 2019

@SpencerLN Well it appears to have survived the night, and our scheduled 7AM wakeup, so I'd call that progress. Unfortunately, the only really 'combinable' subscriptions are for authentication (5; account lockouts, authentication, explicit-credentials, kerberos and NTLM), Windows diags (2; Event-log-diagnostics, windows diagnostics) and exploit guard (4), so this strategy can only get you so far (though it will decrease the number of active subscriptions by 8 (4-1-3) overall).

I also have some suppressions I added to object manipulation because they were just too noisy.

    <Suppress Path="Security">*[EventData[Data[@Name='ProcessName'] and (Data ='C:\Windows\System32\svchost.exe')]]</Suppress>
    <Suppress Path="Security">*[EventData[Data[@Name='ProcessName'] and (Data ='C:\Windows\System32\Services.exe')]]</Suppress>
    <Suppress Path="Security">*[EventData[Data[@Name='ProcessName'] and (Data ='C:\Windows\System32\SearchIndexer.exe')]]</Suppress>
    <Suppress Path="Security">*[EventData[Data[@Name='ProcessName'] and (Data ='C:\Program Files (x86)\Google\Chrome\Application\chrome.exe')]]</Suppress>
    <Suppress Path="Security">*[EventData[Data[@Name='ProcessName'] and (Data ='C:\Program Files (x86)\Adobe\Acrobat DC\Acrobat\AcroCEF\AcroCEF.exe')]]</Suppress>
    <Suppress Path="Security">*[EventData[Data[@Name='ProcessName'] and (Data ='C:\Program Files (x86)\Adobe\Acrobat Reader DC\Reader\AcroCEF\RdrCEF.exe')]]</Suppress>

@dstidham617 I think I have mine set to 180.

@SpencerLN
Copy link

@dstidham617 I do not actually have the Refresh interval configured. Does anyone know what the default Refresh value is?

@ghost
Copy link

ghost commented Apr 6, 2019

@SpencerLN I'm not certain, but the example I gave is straight from the GPO definition so I'd presume 60 seconds.

@novaksam
Copy link

novaksam commented Apr 9, 2019

@SpencerLN Just reporting back; simply combining the authentication subscriptions, while looking positive at first, did not resolve my problems. It seems like heavy intake periods are still killing it, even after I increase the batch items on object-manipulation to 50.

I can still try to revise the exploit-guard subscriptions, since those are some heavy hitters too, but it's looking more and more like I have to move some subscriptions to another server.

@novaksam
Copy link

novaksam commented May 6, 2019

Just letting everyone know, I undid my combined subscription, because I believe it was preventing events from being collected; I was seeing a lot of kerberos, and not much else, which seemed to me that only the last 'Query' in the Xpath is actually executed; I could be wrong about this, and just misattributing one thing for another, but I just wanted to share.

@novaksam
Copy link

I'm going on vacation in a week, so since I'm still dealing with this problem, I wrote a powershell script and set it to run every 5 minutes with task scheduler

# Every once in a while, the process hangs. This scripts aims to remedy that by restarting the process if it finds that it's not processing.
# Get the service processID
$ServicePID = (get-wmiobject win32_service | where { $_.name -eq 'Wecsvc'}).processID
$ProcessInfo1 = (get-wmiobject Win32_PerfFormattedData_PerfProc_Process| Select-Object -Property Name, PercentProcessorTime, IDProcess | Where-Object IDProcess -eq $ServicePID).PercentProcessorTime
Write-Output "First poll: $ProcessInfo1"
sleep 1
$ProcessInfo2 = (get-wmiobject Win32_PerfFormattedData_PerfProc_Process| Select-Object -Property Name, PercentProcessorTime, IDProcess | Where-Object IDProcess -eq $ServicePID).PercentProcessorTime
Write-Output "Second poll: $ProcessInfo2"
sleep 1
$ProcessInfo3 = (get-wmiobject Win32_PerfFormattedData_PerfProc_Process| Select-Object -Property Name, PercentProcessorTime, IDProcess | Where-Object IDProcess -eq $ServicePID).PercentProcessorTime
Write-Output "Third poll: $ProcessInfo3"
sleep 1
$ProcessInfo4 = (get-wmiobject Win32_PerfFormattedData_PerfProc_Process| Select-Object -Property Name, PercentProcessorTime, IDProcess | Where-Object IDProcess -eq $ServicePID).PercentProcessorTime
Write-Output "Fourth poll: $ProcessInfo4"
sleep 1
$ProcessInfo5 = (get-wmiobject Win32_PerfFormattedData_PerfProc_Process| Select-Object -Property Name, PercentProcessorTime, IDProcess | Where-Object IDProcess -eq $ServicePID).PercentProcessorTime
Write-Output "Fifth poll: $ProcessInfo5"

$ProcessInfo = ($ProcessInfo1,$ProcessInfo2,$ProcessInfo3,$ProcessInfo4,$ProcessInfo5 | Measure-Object -Average).Average

Write-Output "Average: $ProcessInfo"

if ($ServicePID -ne 0 -and $ProcessInfo -eq 0) {
    Write-Output "Killing process"
    Stop-Process -Id $ServicePID -Force 
}

While ($ServicePID -ne 0 -and $ProcessInfo -eq 0) {
    sleep 5
    $ServicePID = (get-wmiobject win32_service | where { $_.name -eq 'Wecsvc'}).processID
}

if ($ServicePID -eq 0) {
    Write-Output "Starting service"
    Start-Service -Name Wecsvc
}

@JPvRiel
Copy link

JPvRiel commented May 24, 2019

Thanks to all who've share their experiences with this. I'll try add a bit of my own.

A deeper analysis using ProcessExplorer shown that all wecsvc.exe process threads are in a "Lock"/Waiting status with 0% CPU and 0% RAM used.

Yes, also observed this, especially for collectors handling lots of workstations / member servers.

One thing I've not seen mentioned yet (unless I missed it) is options to tune WinRM, which wecsvc depends on. Take a look at WinRM Service Default Configuration Settings
.

MaxConnections defaults to 300 (or just 25). My guess is that with a lot of WEF clients, this limit might get in the way.

In my experience, our WEF sever with lots of clients behaved poorly with this default and even remote powershell (also uses WinRM) was unresponsive. We never ran out of CPU or RAM, but did see the locked thread issue. Increasing the MaxConnections limit allowed us to move onto the next problem (system ran out of RAM).

I'm not sure if each subscription requires a connection, or if clients are smart enough to use one connection for events from multiple subscriptions.

@novaksam & @SpencerLN , also using winlogbeat and ran into a limitation there as well. We had a single Forwarded Events default channel collecting everything and found that the because of the way the winlogbeat agent uses the windows event API, it can't scale beyond a single thread. On our VMs, it couldn't process beyond ~2000 events per sec despite adding more vCPU. In winlogbeat config, we did a nasty hack of splitting it into multiple inputs by event IDs to force extra threads (one per input). Indeed, Palantir's approach of using multiple channels instead of a single forwarded events channel might well be a good approach to getting more input threads on winlogbeat.

@SpencerLN
Copy link

@JPvRiel Thank you for the updated info. Can I ask what values you had success configuring MaxConnections for? Did you hit some sort of upper limitation with how high you could set it before it began having a negative impact?

@JPvRiel
Copy link

JPvRiel commented Jun 3, 2019

@SpencerLN , TL;DR, eventually I think ran out of memory... With some very rough math/assumptions at play, seems to be ~2MB overhead per log source computer for WinRM + Wecsvc. But there might also be some or the other threading lock contention issue (TBD).

I bumped WinRM up to 10000 connections max on servers with 2x subscriptions defined on each.

winrm set winrm/config/service '@{MaxConnections="10000"}'

Some success on one server, but not the other.

  • On a large site and VM (8 vCPU, 8GB memory) and works well.
    • About ~480 source computers (~960 subscriptions).
    • Peaks at ~1GB memory use for svchost.exe (~2MB per source computer).
    • Manages over 1000 EPS / 4MB/s data volume.
    • This VM used to end up unresponsive, but since increasing MaxConnections, it seems to hold up.
  • On a even bigger site and VM (8 vCPU, 16GB memory) it didn't scale well. It might have run out of memory...
    • Had about ~5500 source computers.
    • Did notice some clients complaining about connection issues to WinRM fetching the subscriptions.
    • Peaks at ~11.5GB memory use for svchost.exe (system seems to always have 2 to 3GB RAM available, so thought there's still a bit of headroom, but apparently not as I can see the page file usage creeps up over time).
    • System always struggled with bad event lag (events from source computers were sometimes over a day old).
    • Seemed not collect more than 1000 EPS (which was suspiciously low compared to the other server).
      • Would've estimated it should be near 10000 EPS.
      • While it was stuck at only 1000 EPS most of the time, when a single log source had a big change, I did see it capable of running at 33000 EPS, so I doubt the the the issue is the number of events.
    • Server eventually becomes unresponsive.

So limitation looks to be the number of subscribed computers and memory usage, not necessarily how many events come in.

I'm also guessing there might be some or the other lock connection trying to handle too many connections. I've noticed this before the VM becomes unresponsive (Can't remote powershell or RDP):

image

From metricbeat, I can see C:\Windows\system32\svchost.exe -k NetworkService which I assume runs wecsvc and WinRM eventually eat's up the server's RAM (RSS active) and the swap file steadily climbs up to 16GB over time until the server grinds to a halt. So waiting on getting more memory... But I can't tell if it's WinRM or wecsvc using all the RAM since there's not sub-process thread specific metrics (at least not with metricbeat).

Given core windows services like WinRM and Wecsvc run as shared process services in svchost.exe, it's hard to isolate their resource usage. I did attempt to test and isolate with sc.exe config Wecsvc type= own, but then wecutil reports this error:

> wecutil.exe es
wecutil.exe : Failed to open subscription enumeration. Error = 0x6b5.
    + CategoryInfo          : NotSpecified: (Failed to open ... Error = 0x6b5.:String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

The interface is unknown.

@JPvRiel
Copy link

JPvRiel commented Jun 3, 2019

On the fact to have 40 dedicated subscriptions, can't this have any impact on parallel network connections from source initated ?

@bluedefxx, I checked, and as far as I see, windows eventing does re-use the same WinRM session for multiple subscriptions (despite 2 subscriptions setup, I see only one connection for a given source computer), so splitting config into multiple subscriptions doesn't seem to add to my connection limit issue.

@rewt1
Copy link

rewt1 commented Jun 4, 2019

@JPvRiel thanks for this, I answer in the name of @bluedefxx, thank you so much for this great analysis.
We also performed some tests and concluded that WinRM client do not reuse the same network connection for multiple subscriptions...
Indeed, we performed the following test:

  • 40 subscription on a source computer => netstat shown about 35 ESTABLISHED at one time...
  • when we limit to 2 subscription => netstat shown 1 or 2 ESTABLISHED at a time...

The amazing thing with the MaxConnection settings is that it is working without that but stops after a while (sometime 2 days, somewtimes 1 week...) despite the limit of 300 (or even 10000 is reached multiple times). This assumption is done by the fact that we have seen +80 000 ESTABLISHED (counting with netstat and ESTABLISHED on the WecServer itself) for +4000 workstations... So count on me to give a try to your setting but I don't understand how it could have an impact. (your new error on RAM is by the way a good indicator of "something" changing)...

I don't understand why Microsoft does not improve the documentation on the WinRM parameter as i am sure many people are impacted... and it clearly looks like a bug.

@JPvRiel
Copy link

JPvRiel commented Jun 5, 2019

@rewt1 , thanks for validating/correcting some of my assumptions (perhaps my test with just 2 subscriptions for 1 source was too limited. E.g. it might have had only had events for one of the subscriptions and wasn't actively sending on the other...)

WinRM client do not reuse the same network connection

Given the above, the approach taken with Palantir splitting the collection into multiple subscriptions might be inefficient from the perspective of possibly overwhelming WinRM with too many connections?

It'd painful to refactor the project to combine subscriptions/channels (I was about to split my own and now won't)

we have seen +80 000 ESTABLISHED (counting with netstat and ESTABLISHED on the WecServer itself) for +4000 workstations

I've used this to inspect WinRM connections:

$WinRMPortConnectionStates = Get-NetTCPConnection -LocalPort 5985
$WinRMPortConnectionStates | Group-Object -Property State | Select-Object -Property Name, Count | Format-Table

The amazing thing with the MaxConnection settings is that it is working without that

If so, that implies MaxConnections isn't being applied as a limit for WinRM? Microsoft documented the setting as:

Specifies the maximum number of active requests that the service can process simultaneously. The default is 300.

So this is likely a bug in their WinRM service not applying the limit as documented.

@peacand
Copy link

peacand commented Aug 5, 2019

Hello guys,

As far as I am concerned, the script of @novaksam solved my problem. Thank you 👍

BUT : If you use this script, make sure to isolate the service Wecsvc. By default it is shared with others such as "DNS Client" or "Network Location Awareness". So that if you restart all these services at once, you break the machine and the WinRM service is crashed. To isolate the wecsvc service :

sc config wecsvc type=own

And then restart the service Windows Event Collector.

As @JPvRiel said, if you do that, the WEC service stops working. It is due to HTTP.SYS ACLs, which are broken when you isolate the wecsvc service from svchost. It is well explained in this KB :

https://support.microsoft.com/en-us/help/4494462/events-not-forwarded-if-the-collector-runs-windows-server-2019-or-2016

As the KB explained, you need to fix the ACL with the following commands after having isolated the wecsvc service from svchost :

netsh http delete urlacl url=http://+:5985/wsman/ 
netsh http add urlacl url=http://+:5985/wsman/ sddl=D:(A;;GX;;;S-1-5-80-569256582-2953403351-2909559716-1301513147-412116970)(A;;GX;;;S-1-5-80-4059739203-877974739-1245631912-527174227-2996563517)
netsh http delete urlacl url=https://+:5986/wsman/
netsh http add urlacl url=https://+:5986/wsman/ sddl=D:(A;;GX;;;S-1-5-80-569256582-2953403351-2909559716-1301513147-412116970)(A;;GX;;;S-1-5-80-4059739203-877974739-1245631912-527174227-2996563517)

After that, you need to restart both WinRM and Wecsvc and you should have a working and stable WEC service 😄

@jokezone
Copy link

jokezone commented Aug 6, 2019

I just stumbled across this best practices for EventLog forwarding article from Microsoft. It touches on a lot of the topics discussed in this thread.

It gives a recommendation of no more than 4k clients per collector server. Where other guidance has suggested it could support 10k (with the right amound of resources).

@novaksam
Copy link

novaksam commented Feb 9, 2022

Hey folks,

So I upgraded to server 2022, and now the wec service will gobble up memory after running for an extended period. I've updated my monitor script to account for this:

# Every once in a while, the process hangs. This scripts aims to remedy that by restarting the process if it finds that it's not processing.
# Get the service processID
$ServicePID = (get-wmiobject win32_service | where { $_.name -eq 'Wecsvc'}).processID
$ProcessInfo1 = (get-wmiobject Win32_PerfFormattedData_PerfProc_Process| Select-Object -Property Name, PercentProcessorTime, IDProcess | Where-Object IDProcess -eq $ServicePID).PercentProcessorTime
Write-Output "First poll: $ProcessInfo1"
sleep 1
$ProcessInfo2 = (get-wmiobject Win32_PerfFormattedData_PerfProc_Process| Select-Object -Property Name, PercentProcessorTime, IDProcess | Where-Object IDProcess -eq $ServicePID).PercentProcessorTime
Write-Output "Second poll: $ProcessInfo2"
sleep 1
$ProcessInfo3 = (get-wmiobject Win32_PerfFormattedData_PerfProc_Process| Select-Object -Property Name, PercentProcessorTime, IDProcess | Where-Object IDProcess -eq $ServicePID).PercentProcessorTime
Write-Output "Third poll: $ProcessInfo3"
sleep 1
$ProcessInfo4 = (get-wmiobject Win32_PerfFormattedData_PerfProc_Process| Select-Object -Property Name, PercentProcessorTime, IDProcess | Where-Object IDProcess -eq $ServicePID).PercentProcessorTime
Write-Output "Fourth poll: $ProcessInfo4"
sleep 1
$ProcessInfo5 = (get-wmiobject Win32_PerfFormattedData_PerfProc_Process| Select-Object -Property Name, PercentProcessorTime, IDProcess | Where-Object IDProcess -eq $ServicePID).PercentProcessorTime
Write-Output "Fifth poll: $ProcessInfo5"

$ProcessInfo = ($ProcessInfo1,$ProcessInfo2,$ProcessInfo3,$ProcessInfo4,$ProcessInfo5 | Measure-Object -Average).Average

$ProcessMemInfo1 = (get-wmiobject Win32_PerfFormattedData_PerfProc_Process| Select-Object -Property Name, WorkingSet, IDProcess | Where-Object IDProcess -eq $ServicePID).WorkingSet#
# Set our upper cap at 10GB
$ProcessMem = $ProcessMemInfo1 -ge  10000000000 

Write-Output "Average: $ProcessInfo"

# Low CPU check
if ($ServicePID -ne 0 -and $ProcessInfo -eq 0) {
    Write-Output "Killing process"
    Stop-Process -Id $ServicePID -Force 
}

# High mem check - 10GB
if ($ServicePID -ne 0 -and $ProcessMemInfo1 -ge  10000000000 ) {
    Write-Output "Killing process"
    Stop-Process -Id $ServicePID -Force 
}

# Low CPU service up check - Need to wait for it to terminate
While ($ServicePID -ne 0 -and $ProcessInfo -eq 0) {
    sleep 5
    $ServicePID = (get-wmiobject win32_service | where { $_.name -eq 'Wecsvc'}).processID
}

# High mem service up check - Need to wait for it to terminate
While ($ServicePID -ne 0 -and $ProcessMemInfo1 -ge  10000000000) {
    sleep 5
    $ServicePID = (get-wmiobject win32_service | where { $_.name -eq 'Wecsvc'}).processID
}

# Now that the service is down, start it back up
if ($ServicePID -eq 0) {
    Write-Output "Starting service"
    Start-Service -Name Wecsvc
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants