Data Analysis #2: What are the real win rates by BR, and how good is Thunderskill really?

The question for today: what does a recent massive data scrape of all Ground RB data tell us about previous attempts to assess win rates in that mode? How close were Thunderskill and War Thunder Data Project to the actual reality? And what are the reasons why they would have missed?

A little background. Thunderskill.com is a Russian-hosted website that collects opt-in player data from War Thunder service records. It’s been around a long time, but hasn’t been updated as a website in quite a while. More recently, another website, War Thunder Data Project (link), has been using Thunderskill API access to retrieve Thunderskill data every few days and represent as a heat map of win rates by battle rating and nation. While many people haveraised questions about representativeness of a sample of data gathered this way, there was never much by the way of data to compare theirs to, as Gaijin guards their aggregated data quite well.

That guard slipped a little in January 2024, when an anonymous downloader associated with Russian War Thunder content creator K2 Kit Krabiwe used the publicly available server replay feature to pull approximately 72 hours of data, from January 4 to 7, comprising every game played in that period in the Ground RB mode. Krabiwe did a couple episodes on it and the data was made widely available and seems to check out, comprising the game files of over 100,000 separate games. The actual win/loss table for those three days looks like this.

Spoiler

2024-02-06_01-50-02
K2 scrape data on national winrates by BR

This cross-sectional sample allowed a unique opportunity (Gaijin has taken steps to prevent it happening again) to figure out how accurate Thunderskill and WTDP have been all along. And the answer is… yeah, not really.

The comparable table in WTDP for Jan 7 looks like this.

Spoiler

On the surface there seem to be some obvious similarities. US high-tier performance is low in both for instance. But a closer look sees a lot of variation. Given that in the K2 scrape we now have the actual data (all of it!) WTDP was guessing at, what’s going on here? Because we aren’t gonna get another K2 scrape, how much can we assess from this as far as if we can trust WTDP going forward as a substitute?

Part of the issue lies in presentation. One graphic has the high BRs at the bottom, the other at the top, the country order is different, and the color scheme is quite different. The midpoint for the heatmap that WTDP uses is also a little hard to parse, with 55% being the neutral color as opposed to 50%. The first step here would seem to be to look at the two data sets in a comparable presentation.

When you break out the WTDP data and the K2 scrape data using the same color grade and other presentation aspects, they end up looking like this.

Spoiler

The sample: Win Rates, WTDP Jan 7

The actual: Win Rates, K2 Jan 4-7

It can be hard to figure out what’s changed shifting the eye back and forth between two graphics and over so many boxes. A simplified side by side comparison, broken out by rank brackets* and countries, looks like this.

Spoiler

There are definitely some similarities, but some differences, too. Why is that?

To understand this, first we need to know a little more about the two sources of data and how they work.

The K2 scrape is granular down to the level of player and battle. Each player in each replay is given a BR value (and nation value) for that match based on the top BR (and nation) of the vehicles, air and ground, that they actually played. For our purposes, we’ll call each match for each player as one “flyout”, to use the air term. And when you crunch 72 hours of all flyouts that way, what it shows is a ground RB player base significantly focussed on the top tier of the game:

Spoiler

Further, it also showed significant weighting of the player base toward the “big three” countries. Using data bars to show all the ground RB flyouts by nation and BR that were actually played in the K2 scrape period gives us this:

Spoiler

This was of course during a major scored event. To no one’s surprise a lot of players were playing lower-tier rank VII, where the maximum event multipliers resided. This, combined with players favoring the big three nations, meant there were twice as many German 10.3 flyouts alone (266,147) as there were all of rank I flyouts combined (117,170).

There were also a number of BR/country combinations with no flyouts at all, or very few. This is at least partly due to lineup holes. The combinations in the first table people have seen of this data that were left blank are shown here in red text. (There’s a few others they probably could have removed for really low data, including a few of the most apparently favorable French BRs; I have underlined those sus results on the previous side-by-side comparison chart.)

What is the effect of having a lot of players playing the same thing? Statistically we can show a small correlation between high player numbers in a given BR and country and a lower win rate. The obvious explanation will be players moving to the “smaller” countries later in their player lifetime, so they’re more experienced going into that country area for the first time, whereas there will inevitably be a higher proportion of less experienced players in areas with higher populations. We can see how big this effect is by plotting flyouts per BR against win rate for that BR (here on a vertical log scale to make it easier to analyse). It’s subtle, but it’s definitely there:

Spoiler

This would tend to explain a large part of why the first three columns of both the data graphs are redder than the rest, most obviously in this case Germany 10.3 and Russia 10.0: more players playing means a lower win rate. (Some of the rest of this difference is still going to be due to better lineups, with better vehicles, too, of course. Some nations do actually perform better than others, as well.) So it seems absolutely legitimate to conclude that the US really was underperforming at top tier here… just maybe not quite as much as shown.

You may think the K2 scrape data seems more green overall than red. This is also an effect of those larger populations of flyouts on the left side performing worse on average. When you actually factor in the number of flyouts involved to, the combined win rate on the entire K2 graph on all the BR/nation combinations put together is .5009, which is pretty much what you’d expect if the sample was really complete and the computer analysis of it has been sound.

Anyway, that’s the reality. Turning to the WTDP data from the same period, Jan 4-Jan 7, we see a couple things off the bat. The data seems almost suspiciously smooth and complete. There’s fewer drastic shifts from one BR to another. It doesn’t include any data at all for BRs 1-1.7. There are no holes, even in areas where we know from the K2 data there were almost no games played. Could we have predicted the reality of K2 from this depiction? In some places… again US top tier seems so bad in this period that effect pushes through. But conversely, low-tier China-Italy-France seems quite off. Why is this? And can WTDP, knowing what the real data was for this period, correct what they do to make what they’re putting out better?

So there’s a couple things going on here. The issues are not, actually, that the subset of Thunderskill players isn’t representative as players. It’s largely in the imperfect data collection. So we need to understand a little more about how that works.

Thunderskill is largely an abandoned site, it must be said. It’s a black box and no one is in a place to change much about its inner workings. The WTDP people are scraping its data from an open API but they have no influence over how it’s collected. Here’s how that works. When you log in to Thunderskill and press the button to update your stats (or someone else logs in, visits your user page and updates your stats for you) it uses its API access to your Gaijin account to scrape any changes off the vehicle stats pages of your player record in game, and adds them to your Thunderskill record and the record for that vehicle. it only does this when you choose to visit them. The data they have has no other date on it, it’s just a change to the number of flyouts, wins, and kills (air and ground) for that vehicle since the last time you updated Thunderskill that is recorded.

After a certain point (the site claims a month) old data is dropped from vehicle page calculations. The games and wins are removed, and the vehicle scores are recalculated. This is important: a player can log in after a year and add a year’s worth of data to a vehicle and that data will change its stats as if that year’s worth of games all took place today. But then roughly a month later all that data is removed and the battles and wins from that period of that vehicle’s record go away.

WTDP then pulls the last-month’s stats for each vehicle as given on the site for that day by API and adds another level of data smoothing to this. They don’t capture the change, they just snapshot the data from the Thunderskill site as of that day, every two to three days. So the battles for a vehicle in their data for Jan 4 to 7 aren’t the actual battles that were fought Jan 4-7, they’re not even the battles that were recorded Dec. 7 to Jan. 7; they’re all the battles that were recorded on the system by a user checking their stats in roughly the last month, regardless of when they were fought.

WTDP then adds ANOTHER level of data smoothing, by placing that vehicle’s stats in the 4 BRs it could have been fighting in (full uptier to full downtier) adding all the Thunderskill win rates for all ground vehicles that a player could have played in that BR together, and weighting the final result by the number of Thunderskill games recorded for each vehicle. Because they assume the lowest BR of any game is 2.0, they have no results for BRs 1.0-1.7 as BRs for this reason. The 2.0 bracket contains all vehicles 1.0-2.0, the 2.3 bracket contains all vehicles 1.3 to 2.3, and so on. Driving out a vehicle in a lower BR than the rest of your lineup is not considered, effectively just moving those flyouts to the lower BR’s where you “should have” driven it out.

This leads to the artificial smoothing we see. If the vehicles you could have used at a given BR step don’t vary between the step and the one above or below it, this will produce the exact same number on the WTDP graph on the next box up. (You can remove this smoothing a little yourself to only see areas where they actually have their vehicle data by changing the BR interval to zero on the site.)

So the important thing to remember is the number of games that Thunderskill or WTDP puts online for a given date is actually for the last month of new player reports, not the period since the last graph. This inevitably introduces considerable lag in their results. Some of the underlying data is from previous patches, previous updates, even. This is why you hear people saying “vehicle x’s performance hasn’t changed in six months”. Given the base data, it can’t, not really. Change in WTDP, in addition to being obscured by their spreading results over 4 BRs, lags change in the game by a considerable amount. Combined with the use of a non-standard midpoint for results to get equal amounts of red and green on their depictions, as mentioned above, this leads to a rather misleading final table.

Another factor is of course, air vehicles, both helos and airplanes. Those aren’t factored in to WTDP ground performance at all, because the player service record Thunderskill relies on puts those results in the air tabs, not the ground tabs. This has the double effect of making most Thunderskill and WTDP air data even less useful than the ground vehicle data (because it factors in ground and naval CAS uses equally with actual air AB/RB games) but also pushing win rates lower for countries depending on how much that country relies on its air game. At best you could say it’s giving you an estimate of how good that country is at that BR, if they were to take away all their air vehicles. Of course, some countries have pretty good air vehicles, so removing them from the calculation entirely is going to have an impact. This is maybe more forgivable for fixed-wing planes, where the data is split in the original Gaijin player record in awkward ways. Less forgiveable for helos though, which are currently only available in ground modes, but still aren’t factored into Thunderskill data or WTDP’s national win rates by BR in any way.

(Yet another factor, mentioned previously, is players playing under-BR’d vehicles. The wins and losses for those who take a Puma or a BT-5 into a higher BR game in the WTDP projection are effectively being moved by this method over to the BR’s that vehicle “should” have been playing at.)

Note none of this has anything to do with whether Thunderskill players are a “representative sample” of the player base. If it were actually a representative sample of just hardcore players, that data could still have considerable value. You could adjust down for the “average player” across the board. But it’s not. The original Thunderskill method, and its opacity about when battles and wins are being removed from their records, adds a HUGE level of fuzziness even to using them to figure out “hardcore” player metrics. And then the WTDP depiction, by hiding that these are snapshots of a black box database, with a long tail on each data point stretching over an indefinite period into the past, by having to exclude helo and cas, by smoothing data over the 4 BRs that the vehicle could have been playing at when it got that win or not in their visual representations, adds even more fuzziness.

So what’s the bottom line: WTDP’s use of Thunderskill data is still able to detect some high level gross effects (it’s remarkable it actually UNDERstated the real poor performance of US top-tier in this three-day period, but there’s that lag effect again). It does successfully show the effect of large player populations on average win rate. But the comparison of what it was predicting with the reality of the K2 scrape shows it can’t really show us anything more granular with any real precision, and a lot of the smoothing they’re applying is hiding that uncomfortable fact, more than it’s adding any analytical value. The results they offer lag game changes, even entire updates. Since it’s abandonware, Thunderskill itself is also unlikely to survive any significant change to player records by Gaijin, as has been promised in the roadmap for later this year. And we’re unlikely to get another K2 scrape again because Gaijin has locked the player replay system down to prevent it. If players are going to have any real independent data going forward, new approaches will need to be tried.

The K2 scrape itself, however, remains a rich mine. The data has been made publicly available and there’s a lot there. A lot of interesting questions would benefit from being run through that snapshot in time and mode. (An example would be, “is one-death leaving actually a good strategy?” Or “how much of an effectiveness penalty do you suffer if you don’t fly planes?”). It will still be relevant for quite a while. Amateur data scientists should grab a copy if they can.

One final note: Thunderskill is abandonware now. No one is quite sure who has backend access to your account credentials, or who will in the future. If you are on Thunderskill, or planning to sign onto it, please please put two-factor authentication on your game account, as well as the email address you’ve associated to it.

See also my previous post: Data Analysis: what is the actual average player's score per mode? (and how long will it take you to do this event)

*rank brackets are approximate. I’m well aware some vehicles are outside their default rank range. For this I used as rank approximations: Rank I: 1-2.3, Rank II: 2.7-3.7, Rank III: 4-5.3, Rank IV: 5.7-6.7, Rank V: 7-8.3, Rank VI: 8.7-9.7, Rank VII: 10-11.3, Rank VIII: 11.7. It’s just groupings to simplify the results a bit, the exceptions don’t really matter.

14 Likes