Threadpool does not really fit into the same category as poll and epoll, so I will assume you are referring to threadpool as in "threadpool to handle many connections with one thread per connection".
Pros and cons
- Reasonably efficient for small and medium concurrency, can even outperform other techniques.
- Makes use of multiple cores.
- Does not scale well beyond "several hundreds" even though some systems (e.g. Linux) can in principle schedule 100,000s of threads just fine.
- Naive implementation exhibits "thundering herd" problem.
- Apart from context switching and thundering herd, one must consider memory. Each thread has a stack (typically at least a megabyte). A thousand threads therefore take a gigabyte of RAM just for stack. Even if that memory is not committed, it still takes away considerable address space under a 32 bit OS (not really an issue under 64 bits).
- Threads can actually use
epoll, though the obvious way (all threads block on
epoll_wait) is of no use, because epoll will wake up every thread waiting on it, so it will still have the same issues.
- Optimal solution: single thread listens on epoll, does the input multiplexing, and hands complete requests to a threadpool.
futex is your friend here, in combination with e.g. a fast forward queue per thread. Although badly documented and unwieldy,
futex offers exactly what's needed.
epoll may return several events at a time, and
futex lets you efficiently and in a precisely controlled manner wake N blocked threads at a time (N being
min(num_cpu, num_events) ideally), and in the best case it does not involve an extra syscall/context switch at all.
- Not trivial to implement, takes some care.
fork (a.k.a. old fashion threadpool)
- Reasonably efficient for small and medium concurrency.
- Does not scale well beyond "few hundreds".
- Context switches are much more expensive (different address spaces!).
- Scales significantly worse on older systems where fork is much more expensive (deep copy of all pages). Even on modern systems
fork is not "free", although the overhead is mostly coalesced by the copy-on-write mechanism. On large datasets which are also modified, a considerable number of page faults following
fork may negatively impact performance.
- However, proven to work reliably for over 30 years.
- Ridiculously easy to implement and rock solid: If any of the processes crash, the world does not end. There is (almost) nothing you can do wrong.
- Very prone to "thundering herd".
- Two flavours (BSD vs. System V) of more or less the same thing.
- Somewhat old and slow, somewhat awkward usage, but there is virtually no platform that does not support them.
- Waits until "something happens" on a set of descriptors
- Allows one thread/process to handle many requests at a time.
- No multi-core usage.
- Needs to copy list of descriptors from user to kernel space every time you wait. Needs to perform a linear search over descriptors. This limits its effectiveness.
- Does not scale well to "thousands" (in fact, hard limit around 1024 on most systems, or as low as 64 on some).
- Use it because it's portable if you only deal with a dozen descriptors anyway (no performance issues there), or if you must support platforms that don't have anything better. Don't use otherwise.
- Conceptually, a server becomes a little more complicated than a forked one, since you now need to maintain many connections and a state machine for each connection, and you must multiplex between requests as they come in, assemble partial requests, etc. A simple forked server just knows about a single socket (well, two, counting the listening socket), reads until it has what it wants or until the connection is half-closed, and then writes whatever it wants. It doesn't worry about blocking or readiness or starvation, nor about some unrelated data coming in, that's some other process's problem.
- Linux only.
- Concept of expensive modifications vs. efficient waits:
- Copies information about descriptors to kernel space when descriptors are added (
- This is usually something that happens rarely.
- Does not need to copy data to kernel space when waiting for events (
- This is usually something that happens very often.
- Adds the waiter (or rather its epoll structure) to descriptors' wait queues
- Descriptor therefore knows who is listening and directly signals waiters when appropriate rather than waiters searching a list of descriptors
- Opposite way of how
- O(1) with small k (very fast) in respect of the number of descriptors, instead of O(n)
- Works very well with
eventfd (stunning timer resolution and accuracy, too).
- Works nicely with
signalfd, eliminating the awkward handling of signals, making them part of the normal control flow in a very elegant manner.
- An epoll instance can host other epoll instances recursively
- Assumptions made by this programming model:
- Most descriptors are idle most of the time, few things (e.g. "data received", "connection closed") actually happen on few descriptors.
- Most of the time, you don't want to add/remove descriptors from the set.
- Most of the time, you're waiting on something to happen.
- Some minor pitfalls:
- A level-triggered epoll wakes all threads waiting on it (this is "works as intended"), therefore the naive way of using epoll with a threadpool is useless. At least for a TCP server, it is no big issue since partial requests would have to be assembled first anyway, so a naive multithreaded implementation won't do either way.
- Does not work as one would expect with file read/writes ("always ready").
- Could not be used with AIO until recently, now possible via
eventfd, but requires a (to date) undocumented function.
- If the above assumptions are not true, epoll can be inefficient, and
poll may perform equally or better.
epoll cannot do "magic", i.e. it is still necessarily O(N) in respect to the number of events that occur.
epoll plays well with the new
recvmmsg syscall, since it returns several readiness notifications at a time (as many as are available, up to whatever you specify as
maxevents). This makes it possible to receive e.g. 15 EPOLLIN notifications with one syscall on a busy server, and read the corresponding 15 messages with a second syscall (a 93% reduction in syscalls!). Unluckily, all operations on one
recvmmsg invokation refer to the same socket, so it is mostly useful for UDP based services (for TCP, there would have to be a kind of
recvmmsmsg syscall which also takes a socket descriptor per item!).
- Descriptors should always be set to nonblocking and one should check for
EAGAIN even when using
epoll because there are exceptional situations where
epoll reports readiness and a subsequent read (or write) will still