Running a data experiment with Google Analytics, to see if a controlled series of events generates the predicted results

Do Blue Foxes Get Better Scores? - Data Experiments -

Do Blue Foxes Get Better Scores? - Data Experiments -


April 13, 2020

Author: Victoria

Why Analytics Are Useful

Games coupled with analytics is a powerful combination. One of the problems in running experiments, particularly anything moderately complicated involving people, is the difficulty of controlling the variables. This problem is completely turned on its head in virtual environments. The possibilities for enhancing both the study of education and the design of educational tools, are limitless.

But there’s a problem. Analytics are powerful tools and the data they can capture from a game is staggering, it can be overwhelming. I’m always a little worried about the reliability of my games measurements. To ease my fears, I’ve conducted a little analytics experiment.

I want to be able to segment my users by different properties, and then see if there’s a difference between different segments average final scores. My goal is to see if Google Analytics aggregates my events correctly.

Analytics Terminology

Events: Triggered when a question is answerd. Each event has 4 properties: a category, an action, an optional label and an optional value. Events categories, actions and labels can be used as dimensions in tables and graphs. The values are aggregated for all matching events in the segment.

Example Event: {category: ‘math’, action: ‘add’, label: ‘3’, value: 100}

Session: How events are grouped.

Segment: Filters sessions by their properties. A session can be included or excluded if it includes a specific event.

This allows us to ask questions like, ‘how did the value of event Y change for sessions with event X vs event Z?

This is an example of a segment: example of a segment that requires an event category matching fox and an action matching blue

My problem? It’s difficult to tell exactly how google groups these event together for you. I initially thought that, if a segment filters for sessions that have a category ‘fox’ and an action ‘blue’, that it would return segments if it had one event with the matching category and action. That’s not how it works apparently.

The Data Experiment

There are two questions per session.

Question 1: What is your favorite animal?

Event answers:

{category: ‘fox’, action: ‘blue’, label: ‘blue-fox’}

{category: ‘swan’, action: ‘red’, label: ‘red-swan’}

{category: ‘bear’, action: ‘blue’, label: ‘blue-bear’}

Question 2: What is 1 + 2?

Event answers:

{category: ‘add’, action: ‘2’, value: 0}

{category: ‘add’, action: ‘3’, value: 100}

This is the experiment page:

the experiment page with the form questions

I created segments with different combinations of Question 1 categories, actions and labels to filter the sessions by. Google analytics returns all sessions matching the segment criteria, and aggregate their values. I then analyzed the number of add events per segment, and their average value.

For each session I emitted a different pair of answers, as shown in the diagram below. There were 7 planned sessions in total... Then one accidental session a day later where I emitted 3 blue fox/add-2 combo's in a fit of curiosity X( That doesn't show up in the analysis, but it does in the live data report.

planned event diagram and the predicted values for different segments

This allows the prediction of the different segments add event values.

For instance, the action ‘blue’ should have 4 events, with an average value of 75.

The Data Experiment Results

I first segmented by the categories ‘fox’, ‘swan’ and ‘bear’. Because these were unique, they provided accurate results

catogory segment results: fox = 3 events, average value 66.67, swan = 4 events, average value 50, bear = 1 event, average value 100

Segmenting by actions ‘blue’ or ‘red’ was a little more interesting. Because there was overlap between the ‘blue fox’ event and the ‘blue bear’ event, this segment captured the sessions of both. As such the aggregate event value different from ‘blue fox’ with 66.67, and ‘blue-bear’ with 100, and yielded their average value of 75.

action segment results: blue = 4 events, average value 75, red = 4 events, average value 50

The Surprising Result

Segmenting by both category and action were the most interesting to me. As stated above, I assumed Google analytics would infer that if I filtered by category and action in a segment, I meant to filter for events that had both the matching properties. This isn’t how it works. The program will include any session so long as the category and action are found on any of its events, even if they’re separate. This resulted in the inaccurate ‘red fox’ segment registering a hit, despite there being no event for this in my code.

One of my sessions involved answering both ‘blue fox’ and ‘red swan’. Because this session had the category ‘fox’ and the event ‘red’, my statistics now register a positive score for an event that doesn’t exist.

category and action segment results: fox and blue = 3 events, average value 66.67, fox and red = 1 event, average value 100, swan and red = 4 events, average value 50, bear and blue 1 event, average value 100

So, segments cannot simultaneously filter by one events different properties. How do I overcome this? So far, using the label as a more specific measurement, usually a category-action pair, allows the segments to return accurate results.

By using the label for color-animal pairs, ‘blue-fox’, ‘red-fox’, ‘red-swan’ and ‘blue-bear’ as I am able to detect sessions with the matching color-animal pair in one condition. Since there was no event containing the label ‘red-fox’, this segment accurately filters all sessions out.

category and action segment results: 'blue-fox' = 3 events, average value 66.67, 'red-fox' = 0 events, 'red-swan' = 4 events, average value 40, 'blue-bear' 1 event, average value 100

The resulting events are ugly to behold, but they seem delightfully effective so far. I’ll be running more experiments in future, just to be sure.

P.S: I added an additional session with 3 blue fox events and 1 add-2 event by accident, which throws the live report values a little off! I'll leave it as an exercise for you to figure out the way this changed the number of total blue fox events, and the average score.

More information about these reports can be found here