Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add single-threaded mode #214

Open
N7Alpha opened this issue Jul 3, 2023 · 11 comments
Open

Add single-threaded mode #214

N7Alpha opened this issue Jul 3, 2023 · 11 comments
Labels
enhancement New feature or request

Comments

@N7Alpha
Copy link

N7Alpha commented Jul 3, 2023

This would entail adding an additional mode to juice_concurrency_t ... something like JUICE_CONCURRENCY_MODE_POLL_AGENT_DIRECTLY

Basically this would mean the user of libjuice would be in charge of polling the agent to get received packets

Here are some of the potential issues I'm aware of

  • If the user doesn't poll frequently enough the connection could die since Keep Alive packets won't be sent
  • Buffering UDP packets is now at the mercy of the OS/OS Socket Library. If you receive too much data packets might be purged... since protocols built on UDP have to tolerate lost packets anyway this probably isn't an issue, and if you know the type and amount of data you're sending like for my use case this isn't a problem.

Here are some benefits

  • Real-time friendlier code. You can get rid of all mutex contention. In theory you could get rid of mutex's altogether (for this mode) although refactoring that would probably be annoying and eliminating contention is practically just as good. My use case involves running emulated games on a background thread. You really don't want to get preempted if it can be avoided.
  • There is a nonnegligible amount of overhead caused by running an extra OS-Level thread. In a broadcast network this problem is exacerbated. I think it is intended JUICE_CONCURRENCY_MODE_POLL will solve this problem as well. After looking through the implementation though it seems like this behavior isn't implemented yet so the default mode always behaves like JUICE_CONCURRENCY_MODE_THREAD (I might be wrong about this)

I might be missing a few things here since my knowledge of networking is incomplete. If the only thing preventing this is something like the few potential issues I listed above then I think punting them to the user of libjuice is reasonable.

@paullouisageneau
Copy link
Owner

paullouisageneau commented Jul 6, 2023

I think this would be a good addition for single-threaded games where the thread spins continuously.

If you receive too much data packets might be purged... since protocols built on UDP have to tolerate lost packets anyway this probably isn't an issue, and if you know the type and amount of data you're sending like for my use case this isn't a problem.

This is clearly the main drawback of this approach. The introduced delay will depend on the polling period achieved by the user. This may induce losses depending on the size of the UDP buffer (which is constrained by OS settings). As you said the sending pattern will mostly determine if this is an issue or not. If it features big bursts, you may run into issues, otherwise you should be fine.

There is a nonnegligible amount of overhead caused by running an extra OS-Level thread. In a broadcast network this problem is exacerbated. I think it is intended JUICE_CONCURRENCY_MODE_POLL will solve this problem as well. After looking through the implementation though it seems like this behavior isn't implemented yet so the default mode always behaves like JUICE_CONCURRENCY_MODE_THREAD (I might be wrong about this)

JUICE_CONCURRENCY_MODE_POLL is actually implemented and makes agents share a single polling thread, which enables scaling in situations where you run multiple agents in parallel. The cost is of course that blocking the message callback of an agent blocks all agents.

@paullouisageneau paullouisageneau added the enhancement New feature or request label Jul 6, 2023
@N7Alpha
Copy link
Author

N7Alpha commented Jul 11, 2023

This is clearly the main drawback of this approach. The introduced delay will depend on the polling period achieved by the user. This may induce losses depending on the size of the UDP buffer (which is constrained by OS settings). As you said the sending pattern will mostly determine if this is an issue or not. If it features big bursts, you may run into issues, otherwise you should be fine.

Okay cool. I think the solution would be to just document the downsides appropriately. The OS also might have some facility to tell you if you're mass purging packets too which would be a nice warning, but I'd have to check.

JUICE_CONCURRENCY_MODE_POLL is actually implemented and makes agents share a single polling thread, which enables scaling in situations where you run multiple agents in parallel. The cost is of course that blocking the message callback of an agent blocks all agents.

This is how I assumed it was working I got confused when I noticed there was no branching on concurrency_mode for both the first and last modes, but I now realize the interface's are loaded from an array using the mode as the index.


I looked into implementing this and I think the simplest way forward would be to augment the implementation of JUICE_CONCURRENCY_MODE_POLL (or maybe JUICE_CONCURRENCY_MODE_THREAD?) to support manual polling. So far I think you would have to add a procedure somewhere that will poll an agent (array of agents?). This procedure would not block at all. This would benefit realtime/performance focused use cases. From what I've seen of the implementation I think the calls to recvfrom are non-blocking. You could still use juice_config::cb_recv to handle new packets. This would keep the interface more homogenous. I'm trying to think of a way to implement this that would reduce code duplication, but this might be kind of hacky.

  1. Do you think it would be easier augmenting the implementation of JUICE_CONCURRENCY_MODE_POLL or JUICE_CONCURRENCY_MODE_THREAD to support single-threaded polling? Or should it just be entirely separate despite whatever code duplication that entails?
  2. Is there an advantage to polling an array of agents or is it reasonable to poll them just one at a time in a non-blocking way?
  3. Do you have any tips or know of any potential hazards that would make implementing this difficult?

@paullouisageneau
Copy link
Owner

You'd indeed need to add a new function to the API, for instance juice_user_poll(), which would only have effect for connections with JUICE_CONCURRENCY_MODE_USER.

1. Do you think it would be easier augmenting the implementation of `JUICE_CONCURRENCY_MODE_POLL` or `JUICE_CONCURRENCY_MODE_THREAD` to support single-threaded polling? Or should it just be entirely separate despite whatever code duplication that entails?

I think it should be separate (in a dedicated conn_user.c file), both to keep the architecture clear and to allow for specific optimizations.

2. Is there an advantage to polling an array of agents or is it reasonable to poll them just one at a time in a non-blocking way?

Yes, recvfrom() typically performs a syscall under the hood, so it would be costly to call it for each socket on each user poll. The implementation should be close to JUICE_CONCURRENCY_MODE_POLL to only receive on sockets with available data. The big differences are of course that polling will be non blocking and called by the user. This means the timeout must always be zero and the next timestamp must be ignored.

3. Do you have any tips or know of any potential hazards that would make implementing this difficult?

In the absence of thread context, you'd need to keep the pollfd array in the registry (contrary to JUICE_CONCURRENCY_MODE_POLL), since it would be very inefficient to reallocate it on every poll. Apart from that I can't think of a specific attention point.

@N7Alpha
Copy link
Author

N7Alpha commented Jul 17, 2023

Thanks for the info it really helped me get a starting point to work from.


I'm partly through the implementation. I was curious about one thing. It seems like locks wouldn't be needed for this new mode since everything is done on one thread (send/recv). However looking closer at the codebase resolving hostnames was moved to a background thread since it could block. I think an atomic bool could be used there instead of a lock and then I could make conn_mode_entry_t::lock_func and conn_mode_entry_t::lock_func no-ops for JUICE_CONCURRENCY_MODE_USER. Unless locking is needed in more places, candidate gathering? This is a slight problem as agent_send grabs the lock still so we're going to be wasting some time in Wait­For­Single­Object(). It might not really matter for performance since agent_send won't contend with the receive code in juice_user_poll(). I could just add if statements to not lock for JUICE_CONCURRENCY_MODE_USER, but that's kind of annoying from a code-maintenance/mental-burden perspective.

1. Am I clear to modify the resolving hostnames code to be lock-free, and then make conn_mode_entry_t::lock_func and conn_mode_entry_t::lock_func no-ops for JUICE_CONCURRENCY_MODE_USER? I'm not sure what you have in mind for the locking interface and what it is supposed to guarantee. That's the only reason I'm asking.


Well after thinking about this for awhile making the code lock-free there would doable just a bit annoying. I think I could be lazy and basically add an int flags;...

typedef struct conn_mode_entry {
	int (*registry_init_func)(conn_registry_t *registry, udp_socket_config_t *config);
	// ...

	int flags;
	mutex_t mutex;
	conn_registry_t *registry;
} conn_mode_entry_t;

to each concurrency mode then have either CONN_MODE_NO_LOCK_SEND_RECV_FLAG or CONN_MODE_RESOLVE_NONNUMERIC_HOSTNAMES_SYNCHRONOUSLY_FLAG. The latter I think would probably be the nicest since for my use case I don't care that it blocks (I'm hosting my netcode on what is already a background thread) and the implementation would be localized to one particular spot in libjuice whereas the former would require adding if statements in quite a few places. Also you could just branch on the concurrency mode of the agent instead of adding a whole new flags field for the former solution.

1. So does the CONN_MODE_RESOLVE_NONNUMERIC_HOSTNAMES_SYNCHRONOUSLY_FLAG approach sound reasonable to you? If you have some other way you'd prefer to implement this just let me know.

... actually I think I've got it figured out I should have a PR soon

@paullouisageneau
Copy link
Owner

There is indeed an issue with server name resolution as it may be run in a dedicated thread. The issue is not only with synchronization inside libjuice, but also that the gathering done callback may be triggered from the thread. Making the resolution always synchronous is indeed a quick fix but the behavior is not optimal. Another simple approach could be never triggering callbacks like gathering done in the resolver thread, but doing it in update instead (which is called from user thread for user poll). In thst case you'd need to keep the locking though.

I think it might not be worth removing the locking since the mutex overhead should be negligible if there is no contention. I guess in this use case the lock frequency will typically be around 100 Hz, maybe 1000 Hz at the maximum, while it would need to be orders of magnitude higher for the performance impact to become significant on a typical OS. The syscalls for sending and receiving should be a performance bottleneck way before the mutexes.

... actually I think I've got it figured out I should have a PR soon

Nice job, I'll gladly review it.

@N7Alpha
Copy link
Author

N7Alpha commented Jul 21, 2023

Thanks for getting back to me. I agree making locking calls for this mode ineffective would be confusing, and could lead to bugs if more concurrent paths are added internally to libjuice. I'm not sure I understand the point you're making about the gathering done callback. Are you referring to if I attempted a lock-free but still concurrent refactor? In any case it doesn't really matter. I will keep the locking in since like you say it's a small overhead for user mode locks with little contention.

On a related note is there any reason why you're using the Win32 *Mutex() over *CriticalSection() functions in thread.h? The latter is more performant because it doesn't require a system call internally its all user mode. This article is what I am referencing they have a nice graph showing the speed difference for acquiring and releasing an uncontended lock. The article is old though so the api's could have improved by then, but if you don't depend on the functionality of the Windows Mutex its seems like it would be free performance.

@N7Alpha
Copy link
Author

N7Alpha commented Jul 21, 2023

,,,Actually its getting a little tricky to add locks back in. I'm very tempted to go back to the synchronous solution. The problem is the lock can't be per agent (too much overhead; more potential for lock aquisition-order deadlocks) nor in the registry (you end up with nearly the same thing as JUICE_CONCURRENCY_MODE_POLL a single synchronization point between all threads). Basically I'm thinking it each lock needs to be thread_local and a reference to that thread_local lock must be stored in each agent conn_impl. My reasoning is juice_user_poll calls will run on mutually exclusive sets of agents so only one lock is needed per set. In my case I will have potentially many threads all using JUICE_CONCURRENCY_MODE_USER with distinct sets of agents. thread_local variables are wasteful and non-portable so some kind of plumbing will need to be done between the user and libjuice to get a mutex per thread. Also even if you do this you do this you now have to write extra logic to lock and update the reference to the thread_local mutex in conn_impl in case the user starts moving agents between threads. With the nonconcurrent solution none of this is needed the only thing the user needs to ensure is agents are accessed non-concurrently in their code. Implementing all of this is doable, but it still doesn't feel like a great solution. I think punting the potential blocking issue to the user is acceptable and at least documenting the behavior. As I see it now I don't think I'm going to implement it with locking unless you or I can come up with a simpler solution or workaround

@paullouisageneau
Copy link
Owner

I'm not sure I understand the point you're making about the gathering done callback. Are you referring to if I attempted a lock-free but still concurrent refactor?

My point is that if you keep the locks and the asynchronous name resolution as it is, the gathering done callback may be called in the resolver thread instead of the user pool thread, which the user doesn't expect. Since there won't be synchronization on user side, it may cause a race condition in user code.

On a related note is there any reason why you're using the Win32 *Mutex() over *CriticalSection() functions in thread.h? The latter is more performant because it doesn't require a system call internally its all user mode. This article is what I am referencing they have a nice graph showing the speed difference for acquiring and releasing an uncontended lock. The article is old though so the api's could have improved by then, but if you don't depend on the functionality of the Windows Mutex its seems like it would be free performance.

The reason is that on Windows critical sections are not recursive whereas mutexes are, if I'm not mistaken (the API design looks quite inconsistent to me compared to pthreads). libjuice uses a recursive mutex to make the API reentrant and allow calling functions from callbacks. This is not the most efficient design but it keeps things simpler. In pratice, a possible enhancement could be using critical sections for normal mutexes and windows mutexes for recursive ones.

Actually its getting a little tricky to add locks back in. I'm very tempted to go back to the synchronous solution. The problem is the lock can't be per agent (too much overhead; more potential for lock aquisition-order deadlocks) nor in the registry (you end up with nearly the same thing as JUICE_CONCURRENCY_MODE_POLL a single synchronization point between all threads). Basically I'm thinking it each lock needs to be thread_local and a reference to that thread_local lock must be stored in each agent conn_impl. My reasoning is juice_user_poll calls will run on mutually exclusive sets of agents so only one lock is needed per set. In my case I will have potentially many threads all using JUICE_CONCURRENCY_MODE_USER with distinct sets of agents.

My understanding was that all agents JUICE_CONCURRENCY_MODE_USER would share the same registry and would all be polled together by juice_user_poll, as that's how concurrency mode are designed. The current design does not allow you to define such sets, how do you plan to achieve this?

@N7Alpha
Copy link
Author

N7Alpha commented Jul 21, 2023

Thanks for getting back to me so quickly.


My point is that if you keep the locks and the asynchronous name resolution as it is, the gathering done callback may be called in the resolver thread instead of the user pool thread, which the user doesn't expect. Since there won't be synchronization on user side, it may cause a race condition in user code.

I'm still not following unfortunately. If locks and asynchronous name resolution [are left] as it is I figure this is would be the exact same implementation as it is on master. Unless this is an existing issue that you just discovered? In that case is the issue is resolver_thread_entry calls agent_update_gathering_done which calls agent_resolve_servers which can call agent->config.cb_gathering_done and you just realized the the user doesn't expect [this]? In the case that it's not an existing issue which novel thing am I doing that now makes this an issue?

When you say Since there won't be synchronization on user side, it may cause a race condition in user code. it makes me think I've definitely misunderstood something you've said, because agent_update_gathering_done is called while the lock is held from the resolver thread so it should be locked in the user code while in the callback.


The reason is that on Windows critical sections are not recursive whereas mutexes are, if I'm not mistaken (the API design looks quite inconsistent to me compared to pthreads). libjuice uses a recursive mutex to make the API reentrant and allow calling functions from callbacks. This is not the most efficient design but it keeps things simpler. In pratice, a possible enhancement could be using critical sections for normal mutexes and windows mutexes for recursive ones.

According to this section from MSDN...

After a thread has ownership of a critical section, it can make additional calls to EnterCriticalSection or TryEnterCriticalSection without blocking its execution. This prevents a thread from deadlocking itself while waiting for a critical section that it already owns. The thread enters the critical section each time EnterCriticalSection and TryEnterCriticalSection succeed. A thread must call LeaveCriticalSection once for each time that it entered the critical section.

it sounds like CRITICAL_SECTION has the recursive locking ability you describe. The API looked fairly similar at first glance to Microsoft's Mutex API however I can't comment on pthread's. By reentrant you just mean with respect to the callback functions I assume? I didn't even think about this so I can see why the locks need to be recursive now.


My understanding was that all agents JUICE_CONCURRENCY_MODE_USER would share the same registry and would all be polled together by juice_user_poll, as that's how concurrency mode are designed. The current design does not allow you to define such sets, how do you plan to achieve this?

For me my whole motivation was removing that single synchronization point otherwise I'd have just used JUICE_CONCURRENCY_MODE_POLL. The reasoning is it can be a source of sporadic preemption (not a large one). I'm not an expert on OS scheduling, but it irks me to knowingly insert something that could potentially cause lag spikes in a game.

For the locking implementation current thought is:

  1. A thread (potentially one-of-many) is created that will use libjuice
  2. Said thread creates tl_mutex for use between it and libjuice exclusively
  3. When creating an agent a new field is added for the mutex to juice_config_t (it is only used for this concurrency mode)
  4. It's pointer is passed to conn_impl_t via downcasting for JUICE_CONCURRENCY_MODE_USER
  5. juice_user_poll takes the tl_mutex as an argument
    • for all agents if agent->conn_impl->mutex != tl_mutex lock the agent's mutex then update it to tl_mutex and unlock the agent's mutex (this might be a data race on agent->conn_impl->mutex? in that case maybe some fiddly atomic access could fix this?)
    • poll()
    • lock(tl_mutex) and update agents and call callbacks accordingly
    • unlock(tl_mutex) return from juice_user_poll
  6. Threads can share agents as long as they ensure the agents aren't accessed concurrently goto 5

I don't need the agent sharing part. I figured I'd describe it since I already commented about it. If you remove that and use _Thread_local then I think the implementation becomes more straightforward. The cost of using _Thread_local comes with portability. It seems people are using libjuice for IoT/embedded so this would be unacceptable I think.


If we go with the synchronous non-locking option I could make a PR on a soonish timeframe. If the locking and async name resolution is essential to the implementation then I could just maintain this feature as a separate shallow fork for the time being and if you or I come up with a good solution then I'll make a PR. It ultimately comes down to what you think is an acceptable implementation since you will mostly be the one maintaining it.

@paullouisageneau
Copy link
Owner

I'm still not following unfortunately.

Sorry if I wasn't clear. My point was simply that if JUICE_CONCURRENCY_MODE_USER is expected to be single-threaded, callbacks must never be called from threads other that the user thread, irrelevant of locking in libjuice.

So I was mentionning that while calling the gathering done callback from the resolver thread is currently fine, it would need to be changed in single-threaded mode. It's no big deal as making the resolution synchronous is a way to solve the issue.

it sounds like CRITICAL_SECTION has the recursive locking ability you describe. The API looked fairly similar at first glance to Microsoft's Mutex API however I can't comment on pthread's. By reentrant you just mean with respect to the callback functions I assume? I didn't even think about this so I can see why the locks need to be recursive now.

Oh, I missed that. Thanks for the details, I'll update the implementation to use critical sections on Windows then.

For me my whole motivation was removing that single synchronization point otherwise I'd have just used JUICE_CONCURRENCY_MODE_POLL. The reasoning is it can be a source of sporadic preemption (not a large one). I'm not an expert on OS scheduling, but it irks me to knowingly insert something that could potentially cause lag spikes in a game.

For the locking implementation current thought is:

1. A thread (potentially one-of-many) is created that will use libjuice

2. Said thread creates `tl_mutex` for use between it and libjuice _exclusively_

3. When creating an agent a new field is added for the mutex to `juice_config_t` (it is only used for this concurrency mode)

4. It's pointer is passed to `conn_impl_t` via downcasting for `JUICE_CONCURRENCY_MODE_USER`

5. `juice_user_poll` takes the `tl_mutex` as an argument
   
   * for all agents `if agent->conn_impl->mutex != tl_mutex` lock the agent's mutex then update it to `tl_mutex` and unlock the agent's mutex (this might be a data race on `agent->conn_impl->mutex`? in that case maybe some fiddly atomic access could fix this?)
   * `poll()`
   * `lock(tl_mutex)` and update agents and call callbacks accordingly
   * `unlock(tl_mutex)` return from `juice_user_poll`

6. Threads can share agents as long as they ensure the agents aren't accessed concurrently goto 5

I don't need the agent sharing part. I figured I'd describe it since I already commented about it. If you remove that and use _Thread_local then I think the implementation becomes more straightforward. The cost of using _Thread_local comes with portability. It seems people are using libjuice for IoT/embedded so this would be unacceptable I think.

This sounds extremely convoluted to me. Also, I wouldn't like exposing platform-dependent internal types like mutexes in the API.

If we go with the synchronous non-locking option I could make a PR on a soonish timeframe. If the locking and async name resolution is essential to the implementation then I could just maintain this feature as a separate shallow fork for the time being and if you or I come up with a good solution then I'll make a PR. It ultimately comes down to what you think is an acceptable implementation since you will mostly be the one maintaining it.

I think the synchronous non-locking option is fine. However it is still unclear to me how you plan to implement the sets of agents.

Why not simply making agents independent and polled separately with a juice_agent_user_poll(juice_agent_t *agent) function? It is simple to implement, it removes any single synchronization point, and the overhead compared to polling multiple agents at once will be negligible if there are only a few agents per thread.

@N7Alpha
Copy link
Author

N7Alpha commented Aug 13, 2023

Okay I see now yeah the callback definitely shouldn't be called from another thread since that would partially defeat the purpose of manually polling. Thanks for explaining it clearly to me.

Oh, I missed that. Thanks for the details, I'll update the implementation to use critical sections on Windows then.

Cool it definitely makes me less concerned about performance issues since it basically makes it a counter instead of a system call.

This sounds extremely convoluted to me. Also, I wouldn't like exposing platform-dependent internal types like mutexes in the API.

Yeah I am in agreement on this.

I think the synchronous non-locking option is fine. However it is still unclear to me how you plan to implement the sets of agents.

At least in the non-locking case this is easy to solve because you just have to trust the user doesn't access an agent concurrently or do some operation that would cause concurrent access. If you're talking about the locking case I don't really know if you can in a nice way, but it doesn't really matter since using a single agent in juice_agent_user_poll(juice_agent_t *agent) would make this a non-issue.

Why not simply making agents independent and polled separately with a juice_agent_user_poll(juice_agent_t *agent) function? It is simple to implement, it removes any single synchronization point, and the overhead compared to polling multiple agents at once will be negligible if there are only a few agents per thread.

Yeah I think this would work well for my use case. I could add back locking since a userland lock is virtually no cost and wouldn't be a global lock/single synchronization-point


Sorry it's taken awhile to get back I've been distracted by the other things I've been working on.

I have a seemingly working implementation (at least for 1 agent). I'm not ready to merge anything yet though for a few reasons. I think once I finish the netcode for the project I'm working on then I'll be ready to submit a PR. Thanks for the help though I wouldn't have known how to approach implementing this without your guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants