Why We’re Writing This
Here at NeuroFlow, we’re all about strengthening the partnership between product design and data analytics. Our business is growing quickly, but it’s still run by a small team; with only 30 people, time is often our most precious resource. This, combined with the highly-regulated nature of mental health services, makes it nearly impossible to do extensive user-testing before and after every feature release, change, or update. When it comes to product design, we have to be thoughtful about which features to implement, what data to measure, how that data is analyzed, and what to do with our findings. This article will demonstrate our approach to this decision making process by walking through the implementation and performance analysis process for a new feature in our app, “Streaks.”
Helping Users Build Healthy Habits Around Mental Health
NeuroFlow’s clinical partners report that 71% of patients on our platform report an improvement in their behavioral health symptoms. This is a great baseline, but there is always room for improvement, especially when it comes to bridging the gap between mental and physical health. Our product team is constantly exploring new features aimed at helping drive user engagement and retention. By encouraging consistent usage, we are ultimately helping people form healthy habits that support their overall wellness.
The “Streak,” a visible counter indicating how many consecutive days a NeuroFlow user completes an activity, seemed like a quick-to-build feature that could provide a significant boost in engagement if implemented properly. It has seen success on popular apps like Snapchat, DuoLingo, and Streaks: The Habit-Forming To-Do List, which is ranked #8 in the Apple App Store’s “Health & Fitness” category. Given our focus on helping individuals develop healthy habits in their own lives, we decided that daily engagement with resources such as mood journals, meditations, and mindfulness modules were positive behaviors to promote, as they would likely be a beneficial influence in people’s lives.
Challenges: Learn Fast, Measure Twice
In an ideal world, we would be able to “A/B test” every new feature before release, including Streaks. That is, the data and engineering teams would release two app variants side-by side with one key feature difference (status quo vs. proposed change), decide on metrics to evaluate their relative performance, and then collect data to see which version performs best. Although this sounds simple enough, conducting a truly effective test for a mobile application that can inform product design decisions is not always as easy as it seems. Considerable work is needed to set up valid experiments, and it can take a long time before statistically significant differences between product versions are observed. A company with hundreds of employees and millions of users might be able to make this a regular part of their workflow with ease, but it is a much bigger challenge at a fast-growing technology company where things change every day. In such an environment, it’s important to prioritize development and engineering time for the projects and initiatives that have the largest impact. Helping teams strike this balance is a skillset that product managers work towards every day.
Data: Evaluating Our Impact
Aside from measuring patient and organization-level information on the Streaks themselves, we wanted to look at specific data that would indicate engagement with NeuroFlow in a way that supported our core company goal of improving patient outcomes. This led us to examine the number of activities and homework (which could range from validated assessments to guided breathing exercises and more) completed by our users. An “activity’ in our system is any module completed by a user for any reason, while ‘homework’ are specifically activities assigned by a health provider. By looking at two different sources of data, we ensure that our conclusions aren’t tied too closely to the measurement methods of any single metric. These data include information on journals, meditations, and digital CBT modules, as well as mood, sleep, and pain trackers.
It’s important to note that all of this data is anonymous, and everything we do at NeuroFlow is HIPAA compliant. Personal information on our platform is end-to-end encrypted, meaning that only you and your health provider can see it. We can see anonymized activity and homework completions, but anything that could potentially identify a patient just looks like a bunch of scrambled numbers and letters to us (this is by design, and merits a separate blog post on its own). Let’s take a visual look at our data regarding the lengths of Streaks among patients.
Metrics: Measuring What Matters
Using SQL, a programming language that enables manipulation of database tables, we extracted the relevant information to assess changes and calculated three key metrics: Average Number of Activities Completed per Patient, Average Number of Homeworks Completed per Patient, and Homework Compliance Rate. We then limited our data to the month before and the month after releasing Streaks. These two periods were the samples we compared to draw our conclusions. Although we can’t show the specifics for everything, one of our comparative metrics indicated that activity completion increased from 37.49 to 39.32 modules/month before and after the introduction of streaks. The rest of our metrics showed similar results.
It seems at first glance that Streaks have provided a slight lift to some of our engagement metrics, woo hoo! The number of activities completed per patient increased in the months post-release relative to prior, while homework completion per patient and homework compliance also increased during the period. What else is there to say?
Testing for Differences: Fooled by Randomness?
Although it’s tempting to look at these metrics and declare “victory!”, there is a very real possibility that our observations are the result of pure chance. That is, we got lucky, and there really is no underlying difference in the behavior that generated our “before” and “after” samples. Let’s separate the signal from the noise.
To see if there really was a change, we will run two different statistical tests on our metrics of choice. These tests, when applied to the data, spit out the probability that our two samples come from the same underlying process. If the probability, or “p value,” is less than a certain cutoff (usually 0.05), then we can say with confidence that our data have two unique “distributions,” and the differences between them are most likely not caused by randomness. That is, we can reject the “null hypothesis” of these two samples being the same with 95% confidence. In our case, the null hypothesis is that Streaks make no difference in user engagement with NeuroFlow, while the alternative hypothesis is that they do.
The reason we decided to use two different tests is to ensure that any conclusions drawn are not fragile to the assumptions underlying a particular statistical method. Thus, the samples were tested assuming both normal and non-normal distributions for our data. Before getting into the tests themselves, I think it’s important to briefly review upon the concept of a distribution. For those who have not taken a course in statistics, a random variable’s “distribution” is a visual and mathematical description of how often the possible values of that variable appear when measured. It is a tool that is often used to describe processes characterized by some element of uncertainty. Although some people assume that percentages, averages, and medians are the be-all-end-all of data, it’s really the distribution (or lack thereof) that matters. As a conceptual example, here are distributions for the observed temperature of two different machine parts. The x-axis represents temperature, while the y-axis represents how many times that particular temperature appears when measured. The temperature measurements of component (a) are “normally distributed,” meaning their frequencies follow a “bell curve” pattern, while the measurements for component (b) are not.
For our “normal” case, we will be using a Two-Sample T Test, which measures diferences between the means of two datasets. Under non-normal assumptions, we will be employing a Wilcoxon Rank Sum test, a method that measures differences between the medians of two datasets. Going into the specifics of how these tests work is a bit too technical for this article, but we’ve included some further reading for those that are interested.
Results: What We Learned
Our Two-Sample T test indicates a significant difference in activity completion before and after the release of Streaks at a p-value cutoff , “alpha,” of 0.05. Our Wilcoxon Rank Sum test indicates significant differences in the population means across all three metrics for our “before” and “after” periods. Even with a generous p-value cutoff of 10%, our largest estimated likelihood that the underlying distribution for any of our metrics is the same in our before and after does not exceed 0.005 or 0.5%. The rest may as well be 0%.
Wrapping It All Up
We can conclude, with some degree of statistical confidence, that the introduction of Streaks had a positive impact on user engagement. This conclusion seems to hold across both normal and non-normal assumptions. One thing that could throw a wrench in our analysis is potential spurious correlation, where the lift in engagement could be caused by other changes in the UX unrelated to Streaks. One way to resolve this would be to perform an A/B test over a set period of time with Streaks as the only difference between versions, although this would require significant time and resources. We should also consider complications that could arise from counting patients out of our activities table. It is possible that a SQL query written for “activities” undercounts the number of patients by pulling user ID from its column in the activities table itself. That is, patients who did not complete any activities had no entries in that table for the period, so they may not be counted. Even still, we can gather insight from the fact that, of all the patients who completed an activity in a given month, they completed more activities after streaks were introduced than before. This complication does not present itself in the homework dataset, as the assigned activities are still logged even if not completed.
Well, that’s it! We hope you enjoyed this brief look into our workflow and how our product and data science teams work together to create a captivating user experience and improve health outcomes for tens of thousands of users across the country.
Thanks for reading! Did you like this article? Are you a data scientist, product manager, or UX engineer with scathing criticism of our methods? Are you concerned about user privacy and want to give us a piece of your mind? Regardless, we appreciate feedback and would love to answer your questions, so feel free to reach out to our Sr. Director of Product at firstname.lastname@example.org.
Michael Harrison Lee is a Product Associate at NeuroFlow, where he integrates design-thinking with data-driven research to continually improve the company’s software platform.
Further reading for the curious: