Do Blue Foxes Get Better Scores? - Data Experiments -
Do Blue Foxes Get Better Scores? - Data Experiments -
April 13, 2020
Why Analytics Are Useful
Games coupled with analytics is a powerful combination. One of the problems in running experiments, particularly anything moderately complicated involving people, is the difficulty of controlling the variables. This problem is completely turned on its head in virtual environments. The possibilities for enhancing both the study of education and the design of educational tools, are limitless.
But there’s a problem. Analytics are powerful tools and the data they can capture from a game is staggering, it can be overwhelming. I’m always a little worried about the reliability of my games measurements. To ease my fears, I’ve conducted a little analytics experiment.
I want to be able to segment my users by different properties, and then see if there’s a difference between different segments average final scores. My goal is to see if Google Analytics aggregates my events correctly.
Analytics Terminology
Events: Triggered when a question is answerd. Each event has 4 properties: a category, an action, an optional label and an optional value. Events categories, actions and labels can be used as dimensions in tables and graphs. The values are aggregated for all matching events in the segment.
Example Event: {category: ‘math’, action: ‘add’, label: ‘3’, value: 100}
Session: How events are grouped.
Segment: Filters sessions by their properties. A session can be included or excluded if it includes a specific event.
This allows us to ask questions like, ‘how did the value of event Y change for sessions with event X vs event Z?
This is an example of a segment:
My problem? It’s difficult to tell exactly how google groups these event together for you. I initially thought that, if a segment filters for sessions that have a category ‘fox’ and an action ‘blue’, that it would return segments if it had one event with the matching category and action. That’s not how it works apparently.
The Data Experiment
There are two questions per session.
Question 1: What is your favorite animal?
Event answers:
{category: ‘fox’, action: ‘blue’, label: ‘blue-fox’}
{category: ‘swan’, action: ‘red’, label: ‘red-swan’}
{category: ‘bear’, action: ‘blue’, label: ‘blue-bear’}
Question 2: What is 1 + 2?
Event answers:
{category: ‘add’, action: ‘2’, value: 0}
{category: ‘add’, action: ‘3’, value: 100}
This is the experiment page:
I created segments with different combinations of Question 1 categories, actions and labels to filter the sessions by. Google analytics returns all sessions matching the segment criteria, and aggregate their values. I then analyzed the number of add events per segment, and their average value.
For each session I emitted a different pair of answers, as shown in the diagram below. There were 7 planned sessions in total... Then one accidental session a day later where I emitted 3 blue fox/add-2 combo's in a fit of curiosity X( That doesn't show up in the analysis, but it does in the live data report.
This allows the prediction of the different segments add event values.
For instance, the action ‘blue’ should have 4 events, with an average value of 75.
The Data Experiment Results
I first segmented by the categories ‘fox’, ‘swan’ and ‘bear’. Because these were unique, they provided accurate results
Segmenting by actions ‘blue’ or ‘red’ was a little more interesting. Because there was overlap between the ‘blue fox’ event and the ‘blue bear’ event, this segment captured the sessions of both. As such the aggregate event value different from ‘blue fox’ with 66.67, and ‘blue-bear’ with 100, and yielded their average value of 75.
The Surprising Result
Segmenting by both category and action were the most interesting to me. As stated above, I assumed Google analytics would infer that if I filtered by category and action in a segment, I meant to filter for events that had both the matching properties. This isn’t how it works. The program will include any session so long as the category and action are found on any of its events, even if they’re separate. This resulted in the inaccurate ‘red fox’ segment registering a hit, despite there being no event for this in my code.
One of my sessions involved answering both ‘blue fox’ and ‘red swan’. Because this session had the category ‘fox’ and the event ‘red’, my statistics now register a positive score for an event that doesn’t exist.
So, segments cannot simultaneously filter by one events different properties. How do I overcome this? So far, using the label as a more specific measurement, usually a category-action pair, allows the segments to return accurate results.
By using the label for color-animal pairs, ‘blue-fox’, ‘red-fox’, ‘red-swan’ and ‘blue-bear’ as I am able to detect sessions with the matching color-animal pair in one condition. Since there was no event containing the label ‘red-fox’, this segment accurately filters all sessions out.
The resulting events are ugly to behold, but they seem delightfully effective so far. I’ll be running more experiments in future, just to be sure.
P.S: I added an additional session with 3 blue fox events and 1 add-2 event by accident, which throws the live report values a little off! I'll leave it as an exercise for you to figure out the way this changed the number of total blue fox events, and the average score.
More information about these reports can be found here