Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[defect]: Rare core dump from eduroam network. #5290

Open
jayDema opened this issue Feb 2, 2024 · 1 comment
Open

[defect]: Rare core dump from eduroam network. #5290

jayDema opened this issue Feb 2, 2024 · 1 comment
Labels
defect category: a defect or misbehaviour

Comments

@jayDema
Copy link

jayDema commented Feb 2, 2024

What type of defect/bug is this?

Crash or memory corruption (segv, abort, etc...)

How can the issue be reproduced?

There is a rare crash that occurs on the eduroam freeradius containers. I would estimate about once every 3-6 months.
eduroam is running from realease_3.2.3 tag. The low frequency makes this lower priority.

Note there was a small change to attempt to fix a state mismatch issue in auth.c from Alan. I could look at a previous core from before the change to verify whether this matters. I expect the earlier crash to be the same location though.

Log output from the FreeRADIUS daemon

There were some simultaneous errors occurring at the time of the crash:

proxy: Failed allocating Id for proxied request
Error: Failed adding event handler for socket proxy address * port 56072: Too many readers

Solving these is a separate issue, but may be related to why the crash is happening/possible.

Relevant log output from client utilities

No response

Backtrace from LLDB or GDB

Loading up the core dump, the exception occurred on line 2834 of main/process.c.  The value for proxy_listener was NULL (request given below).
Given line 2822 checks for null, this would mean the value of request->proxy_listener changed between line 2822 and 2834.

2822            if (!request->proxy_listener || !request->proxy_listener->data) {
2823                    proxy_reply_too_late(request);
2824                    return 0;
(gdb)
2825            }
2826
2827            gettimeofday(&now, NULL);
2828
2829            /*
2830             *      Status-Server packets don't count as real packets.
2831             */
2832            if (request->proxy->code != PW_CODE_STATUS_SERVER) {
2833    #ifdef WITH_TCP
2834                    listen_socket_t *sock = request->proxy_listener->data;

The rarity of this issue would be explained by the fact that the chance of a change between the two lines would be quite small  (I am also inclined to think it is related to exceeding the 512 readers limit).

Backtrace:
(gdb) bt
#0  0x000055dd9b4d54f6 in request_proxy_reply (packet=0x7f55ac28d030) at src/main/process.c:2834
#1  0x000055dd9b4b77dd in proxy_socket_recv (listener=0x55dda3f80b10) at src/main/listen.c:2423
#2  0x00007f55db2c3ec4 in fr_event_loop (el=0x55dda3e27fb0) at src/lib/event.c:710
#3  0x000055dd9b4ab4dc in main (argc=<optimized out>, argv=<optimized out>) at src/main/radiusd.c:634


The value for request was:

(gdb) p *request
$2 = {number = 4182020, timestamp = 1706535578, data = 0x0, listener = 0x55dda3f7e780, client = 0x55dda3185f50,
  packet = 0x55dda4a2fea0, username = 0x55dda4a30420, password = 0x0, reply = 0x55dda4a30320, config = 0x55dda4a31ca0,
  state_ctx = 0x7f55b4099e00, state = 0x0, proxy_listener = 0x0, proxy = 0x7f55a8121c00, proxy_reply = 0x7f55d0257bd0,
  home_server = 0x55dda13d6b30, home_pool = 0x55dda2efd400, process = 0x55dd9b4cda20 <request_cleanup_delay>,
  response_delay = {tv_sec = 0, tv_usec = 0}, timer_action = FR_ACTION_TIMER, ev = 0x7f558c0c1ff0,
  handle = 0x55dd9b4aefd0 <rad_authenticate>, rcode = RLM_MODULE_NOOP, module = 0x55dd9b4f4b9f "",
  component = 0x55dd9b4f4dfe "<REQUEST_CLEANUP_DELAY>", delay = 5, master_state = REQUEST_ACTIVE,
  child_state = REQUEST_CLEANUP_DELAY, child_pid = 140006714034112, root = 0x55dd9b5174e0 <main_config>, simul_max = 0,
  simul_count = 0, simul_mpp = 0, priority = RAD_LISTEN_PROXY, max_time = false, in_request_hash = true,
  in_proxy_hash = false, num_proxied_requests = 1, num_proxied_responses = 1, server = 0x55dd9f70f180 "default",
  parent = 0x0, log = {func = 0x7f55db2ef460 <vradlog_request>, lvl = L_DBG_LVL_OFF, indent = 0 '\000'}, options = 2,
  coa = 0x0, num_coa_requests = 0}
@jayDema jayDema added the defect category: a defect or misbehaviour label Feb 2, 2024
@alandekok
Copy link
Member

I think this is just a race condition. :( There's not a lot to be done about it other than add mutexes everywhere, which will slow things down enormously.

I'll take a look to see if there's anything I can do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect category: a defect or misbehaviour
Projects
None yet
Development

No branches or pull requests

2 participants