Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dual Edge TPU Adapter Causing PCI issues and BSOD Server 2022 #50

Open
Bachperson opened this issue Dec 27, 2023 · 18 comments
Open

Dual Edge TPU Adapter Causing PCI issues and BSOD Server 2022 #50

Bachperson opened this issue Dec 27, 2023 · 18 comments

Comments

@Bachperson
Copy link

Hello, I recently got a Dual TPU adapter and i plugged it in, installed the drivers and it worked great in my Lenovo SR250 blade server. I am using it for AI detection in Blue Iris. Both TPU's were detected and I checked both TPU temps and everything seemed good (in the 30-40c range). The server ran for about 5 min and than had a BSOD. I checked the seating of both the TPU and the pcie slot and gave everything a good blow of air to make sure there were no dust or contaminants. Loaded the server back up but after some time it did the same thing. Here is a log dump from what my server says is erroring out. Any help would be appreciated because when it works, it works amazingly.

0 Power FQXSPPW0008I Host Power has been turned off. December 27, 2023 2:18:13 PM
1 System FQXSPPW0009I Host Power has been Power Cycled. December 27, 2023 2:17:38 PM
2 System FQXSPIO0011N An Uncorrectable Error has occurred on CPUs. December 27, 2023 2:17:05 PM
3 Memory FQXSFMA0006I Unqualified DIMM 1 has been detected, the DIMM serial number is 1AA6ECDC-V20. December 27, 2023 2:04:08 PM
4 Memory FQXSFMA0006I Unqualified DIMM 2 has been detected, the DIMM serial number is 1AADD8FF-V20. December 27, 2023 2:04:08 PM
5 Memory FQXSFMA0006I Unqualified DIMM 3 has been detected, the DIMM serial number is 1AAD66FB-V20. December 27, 2023 2:03:58 PM
6 Memory FQXSFMA0006I Unqualified DIMM 4 has been detected, the DIMM serial number is 1AA6EBA8-V20. December 27, 2023 2:03:58 PM
11 Disks FQXSPSD0000I The M2 Drive has been added. December 27, 2023 2:02:48 PM
12 Power FQXSPPW0008I Host Power has been turned off. December 27, 2023 1:26:26 PM
13 System FQXSPPW0009I Host Power has been Power Cycled. December 27, 2023 2:11:11 AM
14 System FQXSPPW0009I Host Power has been Power Cycled. December 27, 2023 2:05:32 AM
15 System FQXSPIO2006I System ThinkSystem SR250 has recovered from an NMI. December 27, 2023 2:01:17 AM
16 System FQXSPPW0009I Host Power has been Power Cycled. December 27, 2023 2:00:51 AM
17 System FQXSPPW0009I Host Power has been Power Cycled. December 27, 2023 2:00:36 AM
18 System FQXSPIO0006N A software NMI has occurred on system ThinkSystem SR250. December 27, 2023 1:58:59 AM
19 System FQXSPIO0015M Fault in slot 2 on system ThinkSystem SR250. December 27, 2023 1:58:56 AM
20 System FQXSFIO0010M An Uncorrectable PCIe Error has Occurred at Bus 0000 Device 01 Function 00. The Vendor ID for the device is 8086 and the Device ID is 1901. The Physical slot number is 2. December 27, 2023 1:58:53 AM
21 System FQXSFIO0010M An Uncorrectable PCIe Error has Occurred at Bus 0000 Device 01 Function 00. The Vendor ID for the device is 8086 and the Device ID is 1901. The Physical slot number is 2. December 27, 2023 1:58:52 AM
22 System FQXSPIO2015I Fault condition removed on slot 2 on system ThinkSystem SR250. December 27, 2023 1:58:51 AM
23 System FQXSPPW0009I Host Power has been Power Cycled. December 27, 2023 12:39:19 AM
24 System FQXSPIO2006I System ThinkSystem SR250 has recovered from an NMI. December 27, 2023 12:19:37 AM
25 System FQXSPPW0009I Host Power has been Power Cycled. December 27, 2023 12:19:09 AM
26 System FQXSPPW0009I Host Power has been Power Cycled. December 27, 2023 12:18:53 AM
27 System FQXSPIO0006N A software NMI has occurred on system ThinkSystem SR250. December 27, 2023 12:17:17 AM
28 System FQXSPIO0015M Fault in slot 2 on system ThinkSystem SR250. December 27, 2023 12:17:17 AM
29 System FQXSFIO0010M An Uncorrectable PCIe Error has Occurred at Bus 0000 Device 01 Function 00. The Vendor ID for the device is 8086 and the Device ID is 1901. The Physical slot number is 2. December 27, 2023 12:17:14 AM
30 System FQXSPCA2015I Sensor CPU Overtemp has deasserted the transition from normal to non-critical state. December 27, 2023 12:07:48 AM
31 System FQXSPCA0015J Sensor CPU Overtemp has transitioned from normal to non-critical state. December 27, 2023 12:07:45 AM
32 System FQXSPPW0009I Host Power has been Power Cycled. December 27, 2023 12:06:07 AM
33 System FQXSPIO0011N An Uncorrectable Error has occurred on CPUs. December 27, 2023 12:05

@Bachperson
Copy link
Author

Update, Seems like when there is no load on the TPU it does not cause a crash on the system its only when there is load on it it causes BSOD.

@Bachperson
Copy link
Author

Bachperson commented Dec 30, 2023

Update, tried everything in different computer and it runs stable with all the same software. Tested all various settings in blade server to no avail. Extremely frustrating as it works for like 30 min and works great but will inevitably BSOD. I have found that i can make it BSOD by going to device manager and attempt to disable one of the TPU's and it will trigger BSOD but not in my other desktop. I have tried it in both pcie slots in my Lenovo SR250, forcing gen 2 on the pcie buss enabling pcie error recovery etc. any help or ideas would be greatly appreciated.

@markmghali
Copy link

I had the same issuers on an unraid server. Any load would cause my whole server to stop working. It was very frustrating. I ended up returning it. He said he was going to make another version but haven
t heard about it yet

@Bachperson
Copy link
Author

I pretty much ended up doin the same exact thing, I got the mini pcie tpu with this pcie to mini pcie adapter from Amazon (https://www.amazon.com/dp/B07JBCL1CJ?ref=ppx_pop_mob_ap_share) and haven't had any problems, works totally fine, so who knows.

@markmghali
Copy link

I pretty much ended up doin the same exact thing, I got the mini pcie tpu with this pcie to mini pcie adapter from Amazon (https://www.amazon.com/dp/B07JBCL1CJ?ref=ppx_pop_mob_ap_share) and haven't had any problems, works totally fine, so who knows.

Yeah! That's the same adaptor I am using. You only get one tpu correct?

@mateuszdrab
Copy link

Same issue here, just put a dual TPU adapter in the M2 adapter card which sits in a M2 NVMe PCIe adapter from starcom housed inside a dl380 Gen9.

Running Server 2022, I get a BSOD when I try to passthrough the device into the VM (starting the VM, mounting the device back to the host) and also when installing drivers for the chipset/rescanning devices.

Is the only solution to send the card back and instead get a single TPU mini PCIe adapter?

@Bachperson
Copy link
Author

Yep unfortunately I ended up getting the single tpu and pci adapter (https://www.amazon.com/dp/B07JBCL1CJ?ref=ppx_pop_mob_ap_share) and it's still goin no problems. I saw there was an update on codeproject Ai that added dual tpu support but I doubt that will help with this issue at all as it seems to be more hardware related than software but who knows 🤷.

@mateuszdrab
Copy link

mateuszdrab commented May 1, 2024

Yep unfortunately I ended up getting the single tpu and pci adapter (https://www.amazon.com/dp/B07JBCL1CJ?ref=ppx_pop_mob_ap_share) and it's still goin no problems. I saw there was an update on codeproject Ai that added dual tpu support but I doubt that will help with this issue at all as it seems to be more hardware related than software but who knows 🤷.

Well, I'm trying to understand if the issue is the dual TPU accelerator or the adapter card.

I can return the accelerator easily, not sure about the adapter card.

If it's the adapter card, I can try to put the accelerator into something else and lose the second TPU. The key difference is M key on the PCIe adapter and E key on the accelerator, couldn't find an E to M key PCIe card.

I'll give this a shot with the accelerator as just live with one TPU for now since the other accelerator versions are out of stock and have long lead time.

@markmghali
Copy link

markmghali commented May 1, 2024 via email

@Bachperson
Copy link
Author

Unfortunately I'm pretty sure it's the adapter card or more likely an incompatibility between the adapter card and the PC because I tried it in another computer and it worked fine no problem.

@mateuszdrab
Copy link

Unfortunately I'm pretty sure it's the adapter card or more likely an incompatibility between the adapter card and the PC because I tried it in another computer and it worked fine no problem.

That's interesting, so you've managed to get both TPUs to work in another PC through the adapter?

What were the specs of that PC? I wonder if it's something chipset/PCIe related since the card does does splitting of one PCIe lane.

@magic-blue-smoke
Copy link
Owner

@mateuszdrab unfortunately, in rare cases, there are compatibility issues with some motherboards.
To see if this is a compatibility issue or DoA adapter, could you please test it with another, preferably desktop PC?

@mateuszdrab
Copy link

mateuszdrab commented May 3, 2024

@mateuszdrab unfortunately, in rare cases, there are compatibility issues with some motherboards.
To see if this is a compatibility issue or DoA adapter, could you please test it with another, preferably desktop PC?

I've tested it on two HP gen 9 servers.
I'll try to test it on one of my desktops but it's even older. My new desktop only has one slot which is taken by the GPU.

@Bachperson
Copy link
Author

Sorry for the late reply, the mobo I used that it did work on is a MSI MPG Z690 CARBON Wifi (MS-7D30)

@mateuszdrab
Copy link

Sorry for the late reply, the mobo I used that it did work on is a MSI MPG Z690 CARBON Wifi (MS-7D30)

Thanks

Are you using them on the host OS or passing through into a VM? I wonder if you could do a hyper-v passthrough test (if you're running Windows)

@magic-blue-smoke
Copy link
Owner

@mateuszdrab HP servers have demonstrated higher incompatibility cases with only a few series having no issues. For this reason it's better to test with desktop PC.

I was collecting incompatible hardware cases, need to sum it up and publish

@mateuszdrab
Copy link

@mateuszdrab HP servers have demonstrated higher incompatibility cases with only a few series having no issues. For this reason it's better to test with desktop PC.

I was collecting incompatible hardware cases, need to sum it up and publish

I've been trying to get it working on my desktop PC.
Haven't had a BSOD yet, but not tested hyper-v passthrough as some uefi setting is preventing it from being assignable to a vm.

@mateuszdrab
Copy link

mateuszdrab commented May 6, 2024

After further testing, running some inferencing and upgrading Windows on the PC, I've had no BSODs so far.

Like like the setting needed to enable DDA in hyper-v is missing in my UEFI firmware it bring consumer grade so I was not able to try it.

Coral PCIe Accelerator
BIOS kept control of PCI Express for this device.  Not assignable.

To use SR-IOV on this system, the system BIOS must be updated to allow Windows to control PCI Express. Contact your system manufacturer for an update.
SR-IOV cannot be used on this system as the PCI Express hardware does not support Access Control Services (ACS) at any root port. Contact your system vendor for further information.

Perhaps the same feature is what causes the BSOD on server boards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants