Stacking chips and CUDA cores

This is part 3 of the pokeqt series, where I do poker-related things because I liked a girl who do.

How [not] to run a Monte Carlo simulation
Control Variates and the Satisfying Telescoping Sum
Stacking chips and CUDA cores

We finally came to the implementation section, woo. The source code for all of these experiments can be found here. I even included my debugging code that prints things out as a Christmas gift too.

Branchless programming

The first thing we need to do to write an analytical solver for the heads-up setting. Now, I could use a bazillion of existing solutions on the internet, but that’s no fun. And, I’m writing it all in branchless logic so that I can later run it on CUDA, because branching ruins warp’s parallelism. So here are a couple of interesting functions that I gathered throughout the internet — note that all of these are done on unsigned integers.

With all that being said, you can safely skip this section. I used a bunch of bit tricks as hardware-dependent code is not very portable; but apparently CUDA couldn’t care less, as it has its own set of intrinsic functions similar to the C++ counterpart. Regardless, I’m still listing them here because 1. it’s cool, 2. you gotta go through the same pain I did.

Get the maximum of two numbers

C++

We will use this to compare the computed evaluations of hands.

uint32_t get_max(uint32_t a, uint32_t b) {
  return a ^ ((a ^ b) & -(a < b));
}

If a < b, -(a < b) evaluates to -1, which after underflow is all 1s. We then get a ^ (a ^ b) = b. Otherwise, -(a < b) evaluates to 0, which leaves us with only a.

CUDA

I profiled the code and apparently the ternary operator doesn’t change performance. So:

__device__ uint32_t get_max(uint32_t a, uint32_t b) {
  return (a > b) ? a : b
}

Keep only the most-significant non-zero bit

C++

This is used to keep only the largest hand (e.g. top flush).

uint16_t keep_top_bit(uint16_t num) {
  num |= num >> 1;
  num |= num >> 2;
  num |= num >> 4;
  num |= num >> 8;
  return num - (num >> 1);
}

Basically, the chain of ors will make every bit after the most-significant non-zero bit to be 1, hence the name. Then, we shift it right by 1, meaning now it has one less 1 bit, and remove that smeared suffix from the full smear. Voila, only the MSB is left. You can add another num |= num >> 16; if you want this to work with 32-bit.

I might be the only one who finds the name bit-smearing kinda funny.

CUDA

CUDA has its equivalence of __clz (count leading zeros) that returns the index of the most-significant (leftmost) set (nonzero) bit.

__device__ uint32_t keep_top_bit(uint32_t num) {
  return 1ULL << (31 - __clz(num));
}

Couple interesting tidbits:

The C++ version __builtin_clz evaluated at 0 has undefined behavior! Use std::countl_zero instead. In contrary, the CUDA version returns 32 correctly.
__clz works on bit representation, so it will work with signed negative numbers, but one had to be careful of the sign bit.
Use __clzll if you’re working with 64-bit integers.

Get the flush hand

C++

The bit manipulation version is just a tedious for loop with a lot of boolean casting — the interesting part when you try to do this in CUDA.

uint16_t get_flush_helper(uint16_t hand) {
  uint16_t flush = 0;
  int count = 0;
  for (int i = 12; i >= 0; i--) {
    uint16_t bit = (hand >> i) & 1;
    count += bit;
    flush |= (bit * (count <= 5)) << i;
  }
  // only keep flush track if there is at least/exactly 5 cards of the same suit
  return flush * (count >= 5);
}

Basically this code returns a number with the 5 most significant set bits if there are at least 5, and 0 otherwise.

CUDA

You can use the intrinsic __fns to help simplify things immensely:

__device__ uint32_t get_flush_helper(uint32_t hand) {
  uint32_t idx5 = __fns(hand, 31, -5);
  return (idx5 > 31) ? 0 : (hand >> idx5) << idx5;
}

I don’t think there is a C++ equivalent, so that’s cool.

Bonus: Check if number is 0

C++

This can be used if you don’t trust the compiler to not branch, or to produce a 0/1 when doing a boolean comparison. But I’m learning to have more faith…

uint32_t is_all_zeros(uint16_t x) {
  uint32_t y = -x;
  y = (~y) >> 31;
  return y;
}

Note that the type of y has to be larger than the type of the input parameter x.

CUDA

And if you don’t have faith like me, just profile the code and/or check the compiled instructions. And apparently it doesn’t branch, so return (x == 0);.

Putting it all together

Alright, time to write the actual code.

Game representations

As part of the bit arithmetics, everything in the code is also represented as bits (so that I can abuse the cmp-to-int trick).

Hand cards

Each card is represented as a 64-bit unsigned integer, where 52 bits represents the cards, and the rest are unused — I excuse this by claiming that you need to align your data by blocks with sizes in powers of 2 for CUDA to run fast.

               1         2         3         4         5         6
idx  0123456789012345678901234567890123456789012345678901234567890123
suit ---[   spade   ]---[    club   ]---[  diamond  ]---[   heart   ]
rank ---AKQJT98765432---AKQJT98765432---AKQJT98765432---AKQJT98765432

A few pros of this design choice is:

Cards can just be or‘d together to form a lossless representation of a hand.
I can merge a player’s hand and the community card for easy evaluation.
I can process a straight by or-ing the 4 16-bit chunks to get a suitless hand.
I can process a flush by doing it in 16-bit chunks, and I can process a straight flush by doing it on the merged suitless 16-bit.
Because the hand is lossless, I can convert it back to a string for debugging.

And the rest will be clear in the next section:

Hand scoring

Each score is represented as a 32-bit unsigned integer, where the first 3 bits classify the hand type, the next 13 are which ranks belong to said hand, then another 3 unused bits, and the next 13 for kickers. So, a lot of moving bits around, but you will get a score that’s easily comparable without any ifs.

               1         2         3
idx  01234567890123456789012345678901
hand prf[    hand   ]---[   kicker  ]
rank    AKQJT98765432---AKQJT98765432

Here’s the full table:

Hand type	Prefix	First 13 bits	Second 13 bits
Straight Flush	0b111 (0xE000)	Top SF rank	All zeros
Four of a Kind	0b110 (0xC000)	Quads rank	Top 1 kicker
Full House	0b101 (0xA000)	Top trips rank	Top pair rank
Flush	0b100 (0x8000)	Top 5 flush ranks	All zeros
Straight	0b011 (0x6000)	Top straight rank	All zeros
Three of a Kind	0b010 (0x4000)	Trips rank	Top 2 kickers
Two Pair	0b001 (0x2000)	Top 2 pairs ranks	Top 1 kicker
One Pair	0b000 (0x0000)	Top pair rank	Top 3 kickers
High Card	0b000 (0x0000)	All zeros	Top 5 kickers

This scoring scheme accurately ranks the hand strengths. For each hand (which has 7 cards), the code evaluates the score for each possible hand type; which if it isn’t the real hand type, the returned value would either be 0, or smaller than its real score (e.g. a Full House can be read as a Two Pair, but the prefix makes sure the Two Pair score is lower). Then, we can just take the maximum of all these possible values for the real hand score.

Roll the dice, baby

With all the above helpers, we can now start moving towards writing a Monte Carlo code to estimate these probabilities. The algorithm is described in details in the previous two posts, so I won’t go back in details, but here’s a brief summary:

For each simulation, sample 9 hands and the community cards.
Evaluate hand scores for each player.
For each player, determine their outcome for a $n=2..9$-player game, with $n-1$ players to their right (i.e., keep incrementing player index and wrap back to 0/modulo when exceeding $n$.)
Update the win/loss/$k$-way tie count for the canonical hand of each player, and their covariances with their corresponding 2-player outcomes.

Implementation-wise, I think it’s actually pretty straightforward. Just keep a vector of unsigned long longs (because it takes less than an hour on my CPU to overshoot 2 billions and overflow the limit of the default signed int) to record outcomes for each hand, and increment as needed. The only thing semi-interesting algorithm-wise is a dynamic programming-esque table of size $n\times n$ where the cell in index $(i,j)$ represents the maximum score and its multiplicity (i.e. which $k$-way tie) for a $j$-player game that starts with the player at index $i$. That should be descriptive enough…

Parallelization

I thought of parallelizing each operation, but then after some considering, I realized that it’s actually much easier to just keep each game simulation as an atomic unit, then parallelize them. So I did. There’s just a couple of small points to be careful about:

Set the threads-per-block to 256 per convention, and set the blocks-per-grid so that the code hits 100% for GPU utilization.
Set the number of simulations per thread to high enough so that the amount of on-device work dominates the overhead costs, including kernel launching and data moving between device and host.

Personally, I opted to launch one kernel at a time, then synchronize to measure wall-clock runtime and write the estimator’s checkpoint to disk. The number of simulations is set so that each kernel runs in around 1-2s, which is the desired frequency of checkpoints. I know it’s nothing fancy like fast matrix multiplication in matrix blocks divided across threads like those CUDA tutorials, but I’m maxxing out the GPU trivially, and shouldn’t that be the ultimate goal?

But there’s one important thing that I experienced during this project, and that is memory allocation. I needed to (at the very least) store the trackers for all $n$-player games, which corresponds to $169$ (hands) $\times (3 + \dots + 10)=8788$ variables. Each tracker is a 64-bit unsigned int, because the GPU code is so fast that I exceed the 32-bit limit in less than 10 minutes. That amounts to ~69Kb of stack memory, niftily small even if it doesn’t fit the L1 cache. So I first decided to give every thread its own memory, and after each kernel run we accumulate the results at the very end as read/write is expensive. However, this backfired on me on many levels:

memset-ing to zero out the variables at the beginning of the run is costly.
Total memory for 256 (threads) x 256 (blocks) go up to ~4Gb, which will spill to local memory which is slooooow.

It’s not because of fragmentation or striding, because I tried coalescing, no luck. The alternative is to keep everything in one place and utilize atomic operations, but that would cause a chokepoint at the end of the kernel run, so I didn’t like it at all. But I had to try it anyway, and it actually made the code go 9x faster (!) My only explanation is that this fits cleanly in the L2 cache, and so all read/write are fully done on the L2 cache, which is blazing fast, offsetting the racing atomic ops.

Now in hindsight, if I go hardcore and parallelize within the individual hand evaluation subfunctions, I might have been able to side-step all these issues. But then again, little effort for 100% utilization, I would still chalk this up as an absolute win.

Bait-and-Switch 2: Electric Boogaloo

After the whole dissertation on control variates last post, I tried implementing it, and it slowed down my code by 3x. Basically we ran into the same problem above: allocating a very large array on the stack will move it to slow local memory. So I run the code regardless for a short while, and measured the (co)variances to see how much it would benefit. Specifically, I estimate the ratio of variances between the original and the controlled variance, which if you remember, is:

$$ \frac{\text{Var}(\tilde{p})}{\text{Var}(p)} = 1 - \text{Corr}(p, q)^2 = 1-\frac{\text{Cov}(p,q)^2}{\text{Var}(p)\text{Var}(q)}. $$

And (the reciprocal of) this ratio gives us the theoretical effective sample size when we apply CV, assuming independent samples, which is a direct result of the Central Limit Theorem. And well, for the 3-player win probability, we get 2.01x effective samples, and for the 9-player win probability, we get 1.14x. Comparing to the 3x increase in computation time… I’d rather save myself from this pain and cut my loss. So there isn’t any control variate being applied in the code. Oops.

Future works

I think I got a pretty decent codebase now. There’s a couple of things that I didn’t implement, because it’s not essential to the goal (which is to demonstrate to recruiters that I know how to write CUDA ), that is:

Extend the code to work in real-time for partial board (e.g. post-flop)
- All that’s missing is a data structure to store the results of each state in the MC tree exploration, which should be pretty simple to add.
- Out of specs as I’m not making a full-fledged software here… unless?
- WASM would work pretty nicely, and apparently WebGPU is a thing now…
Experiment to find the best (fixed) set of CVs for ultra-low-frequency $k$-way tie.
- This is just typical ablation study and hyperparameter selection, I’ve done this more often than I want to in my PhD journey already.
- This would also be for kicks, because we established that CV is not worth it…

Bonus: Dress the part

Well, can’t be a complete project without the GUI showing live-update! — said my perfectionism. So I wrote a GUI that fetches result checkpoints from the server and display live estimate update. After terminating the simulation, the result is stored in a JSON file that is served statically. The source code is also in the repository above, and you should check the website out.

Thank you for reading this far! Have fun, and gamble responsibly.