AB Test Tips

AB Test Design

  • Tuning vs Testing - At a high level, AB Tests are suitable for checking whether an idea is good/bad or works/doesn’t work, i.e., binary results. AB Tests are not great for tuning or optimizing game balance. We rarely see clear winners in AB Tests - in most cases, AB Test results are inconclusive, have mixed results, or if repeated, identify different winners. AB Tests sound good in theory but don’t live up to the hoped-for results in practice. If you see a win - great - go with that. If you find no clear winner, either close to control or close to a condition you feel brings more value to the game (trust your instincts).
  • Avoid overlapping AB tests - Do not run multiple tests simultaneously when planning your AB Test roadmap. You want to avoid contamination, which could affect results or interpretation. Use segmentation to target tests to separate groups of players, but keep in mind with fewer players, you may have to run the tests longer to have sufficient data to analyze results.
  • Hypothesis - With AB Tests, we are trying to mimic scientific experiments, much like a scientist in a lab. You should start every experiment with a hypothesis - a testable statement rooted in experience. For example, “Increasing the interstitial cooldown will lead to longer sessions.” Next, we need to clearly define what levers in your game (independent variables, e.g., interstitial cooldown timer) you will adjust and what results (dependent variable, e.g., session length) you will measure. With AB Tests, we aren’t fishing for results - we are running a carefully thought-out experiment to understand how things are related in our game.
  • Sample Size Calculator - Use a sample size calculator to determine how many players you will need in each variant - run the test long enough to acquire those players and their data, then stop the test as soon as you do. We need statistically valid results to avoid drawing incorrect conclusions from random chance and ensure we will likely see the same results if we repeat the experiment. For this, you need enough data. AB Tests are time-consuming, expensive to run, and represent an opportunity cost where you aren’t doing other work. Once you have sufficient data, you should stop your test immediately to avoid dragging out the process. Running a test longer than necessary, paradoxically, won’t give you more information or better results. If you don’t find statistical differences with the calculated sample size, call it inconclusive and move on to your next big idea.
  • Statistically significant vs. meaningful - When running a test, we look for two things: statistically significant results and meaningful differences. With AB Tests, we are hoping to find big changes. For example, we might calculate a D7 retention of 5.02% as statistically significant and higher than 5.00%. While 5.02% is better than 5.00%, do we care, and is that worth our time? Tests that find, at best, marginal improvements are not going to move the needle, and those marginal improvements may go away as you make other changes to your game or have different players.
  • Parallel vs. Serial AB Test - When setting up tests, try using a parallel design where you simultaneously test multiple Variant(s). Sometimes, this isn’t possible, and you must run a serial test where each treatment happens sequentially. The challenge with serial testing is that these take longer, have more risk of market conditions changing, and may overlap releases, holidays, or other factors that can affect results during one test phase. If you must run a serial test, start by collecting your Control data, Variant A, Variant B, … and then rerun the Control. Control data before and after your Variant(s) gives you a baseline for detecting market shifts. If your second Control is higher or lower than the first Control, consider this when analyzing and interpreting the in-between Variant(s).
  • Correlation vs Causation - Correlation means two things move in a related way. Causation means one factor causes another to change. Without AB Tests, we don’t know if the results we see are spurious, caused by confounding factors, or represent actual differences. With AB Testing, we are searching for causative relationships - if I change one value, we see a specific behavior change. We should avoid drawing too many conclusions because AB Tests won’t explain why the result is different. But, sometimes, we don’t care why something happens, only that it does. However, with AB tests, since we don’t know why, there might be other confounding factors we didn’t consider that affect results. Remember that AB Tests aren’t perfect and should only be a general guide that you might be on the right track.
  • Mobile game challenges - AB Tests mimic what a scientist might do in a lab; however, our players aren’t in a lab, and with mobile games, we have some unique challenges to keep in mind..
    • Noisy Data - Players may multitask while playing the game, be distracted, leave the game unattended, answer phone calls mid-level, or do unexpected actions while playing. All these factors contribute to variability in player behaviors, and we see this in the data. This noise can complicate identifying differences between Variant(s) and picking “winners.”
    • Missing/Delayed Data - With mobile games, we typically see event data loss across players. Data loss can be caused by game crashes, internet connection issues, players churning before event data is uploaded, etc. While games usually fire events immediately when a player does something, most games cache those events and upload them to the backend in batches every 1-10 minutes, depending on the backend provider. If a player shuts down their game, the cached events may not upload until the next time they start a new session. As a result, when you analyze a test, you may have missing data.
    • Cheaters - Cheaters in games are a fact of life. Unfortunately, through cheating, players can generate very anomalous data. During analysis, you should carefully look for outliers in your data and determine if cheating is a likely cause. Filter out cheaters so they don’t bias your analysis, but be careful; not all outliers are cheaters. Outliers could also be your whales and very engaged players.
    • Player reinstall, reset, offline mode - Unfortunately, you have little control over your players and their devices. When players reinstall their game, reset progress, or play offline, you may see odd patterns in your data. With reinstalls or game resets, players may appear to have played levels twice. If a player reinstalls, they might be reassigned to a different variant. You might see gaps in their event stream if a player is offline. All these can lead to bias, skew, and challenges when analyzing your data.
    • App releases/updates during AB Test - When you release app updates, you usually have little control over when players update on their devices. When running tests, you will likely need to target your tests to only players with the support app version installed. If you release any updates during a test, that can affect player behaviors and complicate analysis.
    • Release effect - App releases lead to KPI bumps that have nothing to do with actual changes in your game. These release bumps can be easily misapplied to the AB Test analysis and feature being tested.

Target Audience

  • Holdout groups - If you run a series of AB Tests, one strategy is to create a “holdout” group that will be excluded from any AB tests. The holdout group keeps the older settings and provides a baseline for comparison. As you run each sequential AB test, see how performance compares with the holdout and monitor how the holdout changes over time. This holdout is not your Control; it represents the experience before you started testing. A holdout group can help you account for market shifts, seasonality, etc. If you run a sequence of AB tests and each time think you have a win … you then compare those with your original holdout group. If the holdout has also improved performance, the “wins” might not relate to your setting changes but resulted from external factors. If the holdout has declined over time, your market may have shifted, or you have different player types.
  • New players, Veteran Players, or All Players - You almost always want to experiment with only New players. They haven’t seen your game before, so they have no previous biases or expectations that might affect how they react. When shown a new experience, existing players may respond negatively to the change, even if the change is technically a better experience. You do not want to annoy your more valuable veteran players. If you find a win with New players, you can carefully test that change with a small subset of existing players to see how they react. So unless there is a special situation, do not run tests with All players. If you are set up for User Segmenting, you can run AB tests targeting specific user groups. For targeted AB Testing with existing users, use a large Control group and small Variant(s) groups to lessen any negative impacts.
  • Paid vs. Organic - “Paid users” (acquired through User Acquisition campaigns) can have very different quality depending on the source, and typically have a lower value than organic users. When possible, try to run your tests with organic users to remove the outside influence of UA quality. If you must include users from UA campaigns, try to use higher-quality sources or talk with your UA Growth Team to ensure they do not change UA sources during your test.
  • Choosing the Country - Different countries may respond differently to your tests, and with most mobile games, a handful of countries will represent most of your income. Often, we will run AB Tests in alternate counties that are a good proxy for players in our more valuable countries. Find a secondary geo where players have similar behaviors, run your tests there, and then roll the wins out to all countries. Should the test have adverse effects, you aren’t exposing those to your more valuable geos.

Choosing Variant Groups

  • AABB Tests - An often recommended strategy is to run AABB tests. In this test design, we have multiple groups with the same settings. Technically, two groups with the same values should have the same results (revenue, progression, retention, etc.). If two groups with the same settings show very different results, that quickly tells you there is high variability in your audience, and you will need to be cautious in drawing any conclusions. For example, you might set up an experiment with Control (25%), Control (25%), Variant A (25%), and Variant B (25%). First, compare the difference between your two Control groups. In this example, maybe you see a 10% difference in their revenue. Now, when comparing your Variant A & B against Control, you need to disregard any revenue differences less than the observed Control difference (10%) as being from noise. This strategy provides a quick “sanity check” on your experiment. If the Control groups are similar, you can better trust any differences observed between the Control and your Variant(s).
  • Impact of different group sizes - Use balanced group sizes across your Variant(s), e.g., three groups with 33%, 33%, and 34%. While technically, you can run an experiment with different sizes, e.g., 10%, 30%, and 60% - sample sizes can impact statistical tests. You should use more robust statistical tests to detect significant differences if you have very different group sizes.

Choosing Variant Values

  • Test Extreme Values - When defining your Remote Configuration values for each Variant, include some extreme values. For a cooldown, choosing 5, 10, 20, 50, and 100 is better than 5, 10, 15, and 20 - you want to pick values that will bound (below and above) the theoretical best value. In this example, maybe the theoretical best value is 30. With the low values (5 … 20), we might conclude that 20 is the winner. With the values spanning 1 … 100, we might still see 20 as the winner, but we might see that 50 was our second best, telling us that the best value might not be 20, but higher. Extreme values set you up for better follow-up tests.

Integrating your AB Test

  • Ensure instrumentation is correct and AbCohort() called - When using LionAnalytics and Lion Studios tools to analyze results, players’ events must be properly stamped with their experiment and variant names. If using LionSDK and Satori, this is automatic. Suppose you are managing your AB Tests with Firebase or another solution. You need to fire the AbCohort() method at the start of every session to flag which experiment(s) the player is participating in and their variant name(s).
  • Ensure all remote config variables are hooked up and verified - AB Tests work by issuing different values for existing Remote Configuration variables. Before running a test, ensure your Remote Configuration variables are hooked up in code, work, and support the values you will test.
  • Ensure QA has tested all variants for any problems - During the QA phase, before releasing your game, talk with your QA team - explain the AB Test and tell them how they can selectively test each Variant to ensure the user experience is working as intended. We’ve seen tests where a specific variant was buggy and only discovered during the analysis phase. That led to wasted time, losing players, and repeating the experiment once the bugs were fixed.

Running your AB Test

  • 7-day Rule - In mobile games, we see strong weekly seasonality within each week. Install volume varies wildly between Tuesdays (usually a low) and Saturdays (usually a high). We also see different behaviors from players depending on which day they install. Weekend warriors (who only play on weekends) can differ greatly from players who play daily. When running AB Tests, you should include at least 7 days of install cohorts and should follow each cohort for a minimum of 7 days. Always use 7-day increments for how many install cohorts you include and how many days of post-install data you will collect and include in the analysis..
  • Avoid Holidays or other Special Days - Holidays and days with special events might have higher or lower KPIs than expected. Players may show different behaviors on those days, which can skew or bias your results unexpectedly. If a holiday falls within your test, first check if it is anomalous - if yes, then you may need to drop that data during analysis to avoid skewing your results.
  • Understand seasonality - There are well-known trends in players over a week, month, and over the year. For example, IAP revenue usually dips near the end of the month (before payday) and is high at the start of a month (after payday). Over a year, we see patterns - for example, mobile game KPIs tend to be lower early in the year and during the summer. We tend to see players more active in Spring, Fall, and over Christmas. We typically see quarterly cycles in ad eCPMs and UA costs related to corporate quarterly spending patterns, which can influence apparent LTV and the cost of players.
  • Run long enough - Before starting a test, use a sample size calculator to estimate how many players, at a minimum, you need to get significant results. Then, determine how long it will take to acquire those players. If you have a marketing budget, you may be able to acquire more players per day during the test. But remember that players acquired through UA may behave differently than your “regular” players.
  • Don’t run too long - More data doesn’t mean better results. Yes, more data can improve detecting small changes between Variant(s), but your goal should be finding meaningful, i.e. large, improvements. If you don’t see statistically significant differences after collecting the planned data based on a Sample Size Calculator, let it go and move on to your next big idea.
  • No peeking - Except for watching to ensure nothing is broken, avoid being tempted to draw early conclusions during the test. You may see what looks like patterns in early data and draw incorrect decisions. Wait until the planned days have elapsed before analyzing or making any decisions. For example, If we flip a coin 100 times, we expect approximately 50% heads and 50% tails. However, we might initially see H, H, H, H, H, H, H and conclude we have a bad coin and stop flipping. However, if we had continued, there may have been a run of Tails. Testing requires patience and following the plan.

Analyzing Results

  • Assumptions of “normal” - Most statistical tests and common aggregate metrics (e.g., “average” or “mean”) assume your data is normal, i.e., it looks like a “bell curve.” Most of our KPIs in mobile games follow a “long-tail” exponential distribution. For example, most players spend very little, but a few players might spend a lot. Most players have short sessions, but a few have very long sessions. The problem is that we like to summarize data using simple averages - ARPDAU (average revenue per daily active user), ARPPU (average revenue per paying user), average spend, average time-in-app, etc. However, “average” only works with data that is normal. For “long-tail” data, simple aggregates like average can be very deceptive and lead to incorrect interpretation. When running statistical tests (t-test, ANOVA, and others), there is a required assumption of normalcy. If your data isn’t normal, you cannot use those tests. When analyzing your AB Test results, your analysts need to understand these limitations and use appropriate methods for identifying differences between Variant(s).
  • Which KPIs to consider - Lifetime value (LTV) tends to be a gold standard, but you may also be testing to improve retention, players engaging with specific game features, players progressing further into your game, etc. When designing your test, recall your hypothesis statement and your core objective. A test might affect any number of KPIs, but articulate what is most important and avoid being distracted by too much information. Have a single primary objective in mind and, at most, one or two secondary related objectives. Ultimately, LTV pays the bills, but your test might be focused on a specific game mechanic or feature that isn’t directly linked with LTV. If so, focus on the objectives that make the most sense.
  • Realized vs. predicted values - At the end of your AB Test, you have data on what happened throughout your test. If you make decisions based on just data seen during your test, i.e., realized data, you may miss future trends. If the test had run longer, a different variant might have won. For example, you run a test for one week during which Variant A spends less than Variant B. Based on this one week, Variant B is the “winner. However, if the test had run for 4 weeks, possibly Variant A (spent less during week 1) now outspends Variant B during weeks 2 through 4, leading to overall higher revenue, and should be our “winner.” During the one-week test with realized revenue, we missed the future trend. In reality, we cannot run tests forever and must cut them off at some arbitrary point. Instead of realized revenue during the test, a better approach is to use predicted future value. Forecasting future value requires deeper analysis of patterns during the test but will help you make better decisions.
  • Statistical Analysis - Always use statistical analysis when looking at and interpreting results. Never look at raw values when picking your winner. When we run AB tests, there is inherent variability - we need to know if it was a random chance that Variant A was better than Variant B or if there was a difference and Variant A is, in fact, better. Statistical tests essentially tell us, “If we ran this same test many times, how confident are we that the same group would be the winner every time?” or “What is the probability our winner was from random chance?”

Closing your Tests

  • Grandfathering - After running a test and choosing a “winner,” the final step is closing your test. All new players will get the winning variant values; that’s obvious. However, what do you do with the players currently in your test? The two main choices are to force everyone to the winning condition or grandfather players into their current conditions. There are pros and cons to each option. If you push everyone to the winning state, you might upset players who had a different experience and don’t like change. For this reason, if your system supports grandfathering, that can be a good choice. If you want to change their experience and force them to the winning conditions, wait until they update their app version. Players expect to see changes when they update, which may mitigate frustration with a change. Conversely, leaving players grandfathered for an extended period can lead to code complexity and more variability with your existing player base.

Limitations of AB Tests

  • Moment in time - Consider the context when you run your AB Test. For a given test, we can only say, “Variant A is likely the winner given our game’s current player base and set of features.” However, if your player base changes over time (different UA sources, changing market, player expectations, etc.), then Variant A might no longer be the winner. As you add new features to your game, you might add or change something such that the Variant A values are no longer optimal. Technically, you would need to rerun your AB test whenever you make substantial changes to your game or an extended period has passed. We treat AB Tests like they give us a permanent, definitive answer, but in reality, games and players are dynamic and constantly changing.
  • Best of Bad Options - When we run AB Tests, we have to pick arbitrary Variant(s) to test. All our variants might be terrible, and we are merely choosing the best of bad options. AB Tests work best for yes/no questions, not tuning. “Does something work?” “Did the spin wheel increase revenue?” AB Tests are not good at helping identify a game balance value, i.e., tuning. For example, “What cooldown do I need between interstitial ads?” Consider a scenario where 87 seconds is the theoretical optimum we hope to identify with an AB Test. AB Tests can only try a handful of Variant(s) at a time. In experiment #1 you try 10s, 30s, 60s, 120s, and 240s. From this, you determine that 60s performed best. If you stop, you won’t discover the better 87s value. So you run a follow-up with 60s, 90s, and 120s. You now find 90s performed best. Do you run another follow-up trying 85s, 90s, and 95s? As you can see, using AB Tests to find a specific optimum can be time-consuming. We are also assuming that the theoretical optimum (87s) is static and not changing out from under us.
  • Simpson’s Paradox - If you run an AB Test with a mix of different user segments, the results looking at aggregate data (everyone at once) might give an opposite result compared with analyzing each segment separately.
  • Release effects - When you release a new version of your game, you will almost always see a bump in KPIs utterly independent of the changes you made. We’ve tested this effect - we made multiple releases of a game where we didn’t change anything in the game and just labeled each release “Bug fixes” with a new build number. The fact you made a release may have a short-term impact on the KPIs you use for AB Test analysis. It’s safer to release your new version, wait until KPIs return to normal, and then run your test. If you have an AB Test running and release a new version of your app during the test, this can also skew/bias results.
  • Mixed Results - When analyzing an AB Test, conflicting shifts in different KPIs might make interpretation more difficult. For example, Variant A has higher revenue but lower retention, while Variant B has lower revenue and higher retention. What do you do? What is more important? You should consider this when designing your experiment. Mixed results could imply more is going on than you thought, and you might need to restructure your tests.

References