Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel BUG at net/core/skbuff.c:4044! #8771

Open
dracoding opened this issue Apr 28, 2024 · 7 comments
Open

kernel BUG at net/core/skbuff.c:4044! #8771

dracoding opened this issue Apr 28, 2024 · 7 comments
Labels

Comments

@dracoding
Copy link

dracoding commented Apr 28, 2024

The network mode of Calico is BGP. when enabling GRO and GSO, it will crash randomly.

Expected Behavior

Avoid crash when enable gro/gso.

Current Behavior

the stacktrace is as follows.

[16194369.907056] kernel BUG at net/core/skbuff.c:4044!
[16194369.907097] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[16194369.907116] CPU: 13 PID: 0 Comm: swapper/13 Kdump: loaded Tainted: G E 5.16.16-1.el7.elrepo.x86_64 #1
[16194369.907151] Hardware name: New H3C Technologies Co., Ltd. H3C UniServer R4900 G5/RS35M2C16SE, BIOS 5.66 08/02/2023
[16194369.907181] RIP: 0010:skb_segment+0xbc8/0xe00
[16194369.907203] Code: 01 e9 41 89 8e b8 00 00 00 e9 e7 fe ff ff 44 89 c0 39 54 24 7c 0f 86 21 ff ff ff 31 c9 8b 74 24 7c 29 d6 09 f1 e9 07 ff ff ff <0f> 0b a8 01 75 0d 48 81 38 70 b0 7d b9 0f 84 91 fa ff ff 48 8b 7c
[16194369.907256] RSP: 0018:ffffa3f2cce08728 EFLAGS: 00010293
[16194369.907274] RAX: 000000000000007d RBX: 00000000fffff7b3 RCX: 0000000000000011
[16194369.907296] RDX: 0000000000000000 RSI: ffff895ea32c76c0 RDI: 00000000000008c1
[16194369.907317] RBP: ffffa3f2cce087f8 R08: 000000000000088f R09: 0000000000000011
[16194369.907338] R10: 000000000000090c R11: ffff895e47e68000 R12: ffff895eb2022f00
[16194369.907360] R13: 000000000000004b R14: ffff895ecdaf2000 R15: ffff895eb2023f00
[16194369.907381] FS: 0000000000000000(0000) GS:ffff899cbfb40000(0000) knlGS:0000000000000000
[16194369.907405] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16194369.907423] CR2: 00007f6c6b9d6a38 CR3: 0000000128c34002 CR4: 0000000000770ee0
[16194369.907445] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[16194369.907466] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[16194369.907488] PKRU: 55555554
[16194369.907497] Call Trace:
[16194369.907507]
[16194369.907521] tcp_gso_segment+0xf0/0x520
[16194369.907538] tcp4_gso_segment+0x53/0xd0
[16194369.907552] inet_gso_segment+0x150/0x3c0
[16194369.907568] skb_mac_gso_segment+0xa1/0x120
[16194369.907585] skb_udp_tunnel_segment+0x259/0x5c0
[16194369.907601] udp4_ufo_fragment+0x131/0x190
[16194369.907616] inet_gso_segment+0x150/0x3c0
[16194369.907632] ? bpf_prog_df66ced5f148853b_calico_tc_skb_accepted_entrypoint+0x1a6c/0x2c6c
[16194369.907658] skb_mac_gso_segment+0xa1/0x120
[16194369.907673] __skb_gso_segment+0xce/0x190
[16194369.907687] ? netif_skb_features+0xc6/0x2c0
[16194369.907702] validate_xmit_skb+0x15e/0x2b0
[16194369.907716] __dev_queue_xmit+0x234/0xc40
[16194369.907732] ? vlan_dev_hard_start_xmit+0x99/0xf0 [8021q]
[16194369.907751] dev_queue_xmit+0x10/0x20
[16194369.907764] __bpf_redirect+0x1a8/0x320
[16194369.907778] skb_do_redirect+0xed/0x100
[16194369.907793] __netif_receive_skb_core+0xe25/0xf70
[16194369.908489] ? dev_queue_xmit+0x10/0x20
[16194369.909143] ? __bpf_redirect+0x1a8/0x320
[16194369.909761] __netif_receive_skb_list_core+0x12a/0x2b0
[16194369.910362] netif_receive_skb_list_internal+0x1da/0x300
[16194369.910955] ? dev_gro_receive+0x1b3/0x3a0
[16194369.911565] gro_normal_list.part.0+0x1e/0x40
[16194369.912164] gro_normal_one+0x7c/0x90
[16194369.912754] napi_gro_complete+0x7c/0xe0
[16194369.913329] napi_gro_flush+0xb1/0x100
[16194369.913868] napi_complete_done+0xfe/0x190
[16194369.914401] ice_napi_poll+0x146/0x2a0 [ice]
[16194369.914980] __napi_poll+0x2e/0x150
[16194369.915477] net_rx_action+0x221/0x2d0
[16194369.915939] __do_softirq+0xdd/0x2c0
[16194369.916372] irq_exit_rcu+0xa4/0xc0
[16194369.916834] common_interrupt+0x8a/0xa0
[16194369.917254]
[16194369.917663]
[16194369.918062] asm_common_interrupt+0x1e/0x40
[16194369.918464] RIP: 0010:cpu_idle_poll+0x36/0x100

Possible Solution

Disabled GRO and GSO is active.

ethtool --offload eth0 gro off
ethtool --offload eth0 gso off

Context

The patch mentioned in this #6865 doesn't work for me.

analysis the vmcore, it was crashed at BUG_ON(skb_headlen(list_skb) > len).

The gso_size is 75, the frag_list has one element which head_frag is 1. the skb_shared_info struct is as following.

struct skb_shared_info {
nr_frags = 17 '\021',
gso_size = 75,
gso_segs = 0,
frag_list = 0xffff895eb2022f00,
gso_type = 1035,
destructor_arg = 0x2d656c6261747372,
frags = {{
bv_page = 0xfffff80e86d4d180,
bv_len = 125,
bv_offset = 2306
},
....
}
}

In BGP mode, the ebpf will call the bpf_skb_adjust_room() to adjust the gso_size?

Your Environment

Calico version: v3.24.5

@matthewdupre
Copy link
Member

@tomastigera @sridhartigera any thoughts?

@tomastigera
Copy link
Contributor

tomastigera commented Apr 29, 2024

@dracoding what kernel do you use?

ebpf will call the bpf_skb_adjust_room() to adjust the gso_size?

yes, that should happen within the kernel automatically, outside of calico's code (so I assume you are using ebpf)

@dracoding
Copy link
Author

dracoding commented Apr 30, 2024

@dracoding what kernel do you use?

my kernel version is 5.16.20.

ebpf will call the bpf_skb_adjust_room() to adjust the gso_size?

yes, that should happen within the kernel automatically, outside of calico's code (so I assume you are using ebpf)

yes, I'm using the calico with enabling the bpf. no ebpf outside of calico's code.

@tomastigera
Copy link
Contributor

my kernel version is 5.16.20

what distro is it?

@fasaxc
Copy link
Member

fasaxc commented Apr 30, 2024

FWIW, a kernel BUG panic means there's a bug in the kernel, not in Calico. We'll do what we can but please can you report it to your distro vendor. To have a chance of figuring it out we;ll need to know exact details of the kernel/distro/hardware that you're using along with details of your workload that is causing the problem.

Please can you also try a more recent kernel, there have been bugs like this in the past, quite possible this one is already fixed upstream.

@dracoding
Copy link
Author

what distro is it?

CentOS Linux release 7.8.2003 (Core)

@dracoding
Copy link
Author

dracoding commented May 6, 2024

FWIW, a kernel BUG panic means there's a bug in the kernel, not in Calico. We'll do what we can but please can you report it to your distro vendor. To have a chance of figuring it out we;ll need to know exact details of the kernel/distro/hardware that you're using along with details of your workload that is causing the problem.

Please can you also try a more recent kernel, there have been bugs like this in the past, quite possible this one is already fixed upstream.

it was only happening in the cluster enabling calico ebpf mode, maybe this trigger the kernel bug. It doesn't happen frequently, maybe few months once. I'm not sure which workload will cause the problem.

I will try a more recent kernel but it may need a long time to test.

Distro: CentOS Linux release 7.8.2003 (Core).
Kernel: 5.16.20 of the upstream.
NetCard:Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02).

any hardware infomation i will provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants