folly::ThreadCachedInt uses folly::ThreadLocal to store thread-specific objects, each of which has a local counter. When incrementing, the local instance of the thread is incremented. If the local counter exceeds the cache size, the value is flushed to the global counter with an atomic increment. It is this global counter that is read by readFast using a simple load, but does not count any updates that have not been flushed.
In order to read the exact value, ThreadCachedInt uses the folly::ThreadLocal API's readAllThreads() extension to iterate over all references to all related instances of thread local objects. This currently requires getting a global mutex and iterating over the references, collecting the counters along with the global counter. This also means that the first use of the object from a new thread acquires the mutex to insert the thread-local reference into the list. By default, ThreadCachedInt uses one global mutex per integer type. If you plan to use a lot of ThreadCachedInts in your application, consider splitting the global mutex by introducing additional tag template parameters.
set simply sets the value of a global counter and marks all thread-local instances as requiring a reset. When iterating with readFull, thread-local counters that have been marked as reset are skipped. When incrementing, thread local counters marked for reset are set to zero and not marked for reset.
Upon destruction, thread-local counters are flushed to the parent, so the counts are not lost after increments in temporary threads. This requires grabbing the global mutex to make sure the parent itself hasn't already been destroyed in another thread.
There are of course many ways to skin the cat, and you may note that there is a partial alternative implementation in folly/test/ThreadCachedIntTest.cpp that provides similar performance. ShardedAtomicInt simply uses std::atomic<int64_t> arrays and hashes threads between them to do low-content atomic increments, and readFull just sums all the ints.
That sounds great, but to keep contention low enough to achieve similar performance to ThreadCachedInt with 24 threads, ShardedAtomicInt needs about 2000 ints to hash. This uses about 20x more memory, and readFull without a lock has to sum all 2048 ints, which ends up being about 50x slower than ThreadCachedInt in low contention situations, which is hopefully the common case since it's designed for high write, low read access patterns. The performance of readFull is about as fast as ThreadCachedInt in high contention environments.
Depending on the operational conditions, it may make more sense to use one implementation over the other. For example, an environment with lower contention will likely be able to use ShardedAtomicInt with a much smaller array without losing performance, while improving memory consumption and readFull performance.