Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ganesha crash @lock_entry_dec_ref() #1124

Open
skmprabhu252 opened this issue May 4, 2024 · 1 comment
Open

ganesha crash @lock_entry_dec_ref() #1124

skmprabhu252 opened this issue May 4, 2024 · 1 comment
Labels
bug Need Info Need more information from the reporter

Comments

@skmprabhu252
Copy link

(gdb) bt
#0  0x00007fa67d71ab8f in raise () from /lib64/libpthread.so.0
#1  0x00007fa67f620e3d in crash_handler (signo=6, info=0x7fa6477ec3f0, ctx=0x7fa6477ec2c0) at /usr/src/debug/gpfs.nfs-ganesha-5.7-ibm018.00.el8.x86_64/MainNFSD/nfs_init.c:256
#2  <signal handler called>
#3  0x00007fa67cf6facf in raise () from /lib64/libc.so.6
#4  0x00007fa67cf42ea5 in abort () from /lib64/libc.so.6
#5  0x00007fa67cf42d79 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#6  0x00007fa67cf68426 in __assert_fail () from /lib64/libc.so.6
#7  0x00007fa67f65d834 in lock_entry_dec_ref (lock_entry=0x7fa66800d950) at /usr/src/debug/gpfs.nfs-ganesha-5.7-ibm018.00.el8.x86_64/SAL/state_lock.c:650
#8  0x00007fa67f660bd3 in process_blocked_lock_upcall (block_data=0x7fa66800ab20) at /usr/src/debug/gpfs.nfs-ganesha-5.7-ibm018.00.el8.x86_64/SAL/state_lock.c:1822
#9  0x00007fa67f65b010 in state_blocked_lock_caller (ctx=0x7fa658000e30) at /usr/src/debug/gpfs.nfs-ganesha-5.7-ibm018.00.el8.x86_64/SAL/state_async.c:82
#10 0x00007fa67f6a824c in fridgethr_start_routine (arg=0x7fa658000e30) at /usr/src/debug/gpfs.nfs-ganesha-5.7-ibm018.00.el8.x86_64/support/fridgethr.c:486
#11 0x00007fa67d7101ca in start_thread () from /lib64/libpthread.so.0
#12 0x00007fa67cf5ae73 in clone () from /lib64/libc.so.6
(gdb) p *lock_entry
$1 = {sle_list = {next = 0x7fa66800d, prev = 0xc2897b211783a425}, sle_owner_locks = {next = 0x0, prev = 0x0}, sle_client_locks = {next = 0x0, prev = 0x0}, sle_state_locks = {next = 0x0, prev = 0x0},
  sle_export_locks = {next = 0x0, prev = 0x0}, sle_export = 0x1923e80, sle_obj = 0x7fa640003298, sle_block_data = 0x7fa66800ab20, sle_owner = 0x0, sle_state = 0x7fa66800d790, sle_blocked = STATE_CANCELED,
  **`sle_ref_count = -1,`** sle_lock = {lock_sle_type = FSAL_POSIX_LOCK, lock_type = FSAL_LOCK_W, lock_start = 0, lock_length = 0, lock_reclaim = false}, sle_mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
      __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, "\377\377\377\377", '\000' <repeats 19 times>, __align = 0}}
(gdb)

The problem is the refcount is negative (sle_ref_count = -1)

I suspect one of the scenarios below is causing this issue:

  1. In state_release_grant(), we are calling free_cookie with unblock=true even though do_lock_op() does not return success.

  2. In process_blocked_lock_upcall(), we are decrementing refcount even if try_to_grant_lock() does not return success. Specifically, in try_to_grant_lock(), the call_back() failed to grant a lock, but we are still decrementing refcount in process_blocked_lock_upcall().

             status = call_back(lock_entry->sle_obj,
                                lock_entry);
    
             if (status == STATE_LOCK_BLOCKED) {
                     /* The lock is still blocked, restore it's type and
                      * leave it in the list.
                      */
                     lock_entry->sle_blocked = blocked;
                     lock_entry->sle_block_data->sbd_grant_type =
                                                     STATE_GRANT_NONE;
                     LogEntry("Granting callback left lock still blocked",
                              lock_entry);
                     return;
             }
    

The issue is very random. I encountered this crash while running the following test case:

Mount an NFS share using NFSv3 on the client machine twice, and then run the below process on both mount points.

  1. process-1 -> create & delete file in loop.
  2. Process-2 -> try to acquire blocking write lock (running with 5 threads)
  3. Process-3 -> try to acquire blocking read lock (running with 5 threads)
  4. Process-4-> try to acquire overlapping byte range write lock ( 5 threads)
@ffilz
Copy link
Member

ffilz commented May 6, 2024

Do you want to try a fix for those issues and see if it makes any difference?

@ffilz ffilz added bug Need Info Need more information from the reporter labels May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Need Info Need more information from the reporter
Projects
None yet
Development

No branches or pull requests

2 participants