Measuring algorithmic bias to analyze the reliability of AI tools that predict depression risk using smartphone sensed-behavioral data

Table of Contents

Data collection

We analyzed data from a U.S.-based, NIMH-funded study conducted from 2019–2021 to identify associations between behavioral data collected from smartphones and depression symptoms^{3,25,26,27,28,29}. Smartphone sensed-behavioral data on GPS location, phone usage (timestamp of screen unlock), and sleep were near-continuously collected from participants across the United States for 16 weeks and the PHQ-8, a self-reported measure of two week depression symptoms^33,34 frequently used in mental health research^3,5,25,27, was administered multiple times a week every three weeks (on weeks 1, 4, 7, …, known as weekly reporting periods). Sensed-behaviors were summarized over two weeks to align with collected PHQ-8 depression symptoms for prediction (see Table 1). For example, sensed-behaviors collected during weeks 3 and 4 were summarized to predict PHQ-8 responses collected during week 4.

Table 2 summarizes the data used for analysis. 3900 samples were analyzed from 650 individuals, a large cohort and sample size compared to most studies to date analyzing associations between sensed-behaviors and mental health^4,5,25,35,36. A sample was a set of sensed-behaviors, summarized over 2 weeks, corresponding to the average PHQ-8 response collected during a single weekly reporting period. 46% of the average self-reported PHQ-8 values were ≥10, indicating clinically-significant depression (CSD)³³. The majority of participants were relatively young to middle aged (75% 25 to 54 years old), female (74%), white (82%), middle to high income (61% annual family income ≥$ 40,000), insured (93%) and employed (62%). We focused our results on subgroups with at least 15 participants³⁷. The sensed-behavior distributions across the population for each subgroup can be found in the supplementary materials.

Identifying subgroups where AI models underperform

The PHQ-8 asked participants to self-report depression symptoms experienced over 14 days, and PHQ-8’s were delivered multiple times throughout each weekly reporting period. We trained AI models using 14 days of smartphone sensed-behavioral data to predict if the average PHQ-8 value across each weekly reporting period (days 7 through 14, see Fig. 1c) indicated clinically-significant depression (CSD, PHQ-8 score ≥10³³) symptoms. While the PHQ-8 asks participants to self-report 2 week depression symptoms, studies suggest that individual assessments may suffer from recency bias³⁸ or indicate “briefly” elevated depression symptoms³⁹. For this reason, PHQ-8 values were averaged over each weekly reporting period to predict a more stable estimate of self-reported symptoms.

Model performance was assessed by performing 5-fold cross-validation, partitioning on subjects, and predictions across folds were concatenated to calculate model performance. Similar to prior work^4,6, within each cross-validation split, models were trained using data collected from 80% of the participants (520 participants), and the trained model was applied to predict CSD in the remaining 20% (130 participants). To analyze performance variability due to specific cross-validation splits, we performed 100 cross-validation trials, shuffling participants into different folds during each trial.

AI models output a predicted risk score from 0–1 of experiencing CSD. We used the predicted risk to calculate common ranking bias metrics^20,21,22 (Fig. 2) across the subgroups in Table 2. These metrics were based upon the area under the receiver operating curve (AUC), which measured the probability models correctly predicted that CSD samples were ranked higher (in the predicted risk) than relatively healthy (RH, PHQ-8 < 10) samples. We first calculated the AUC within each subgroup (the “Subgroup AUC”). Note that equal Subgroup AUCs do not guarantee high AUC across an entire sample. For example, Fig. 2a shows simulated data where an algorithm correctly predicted CSD risk within subgroups, but younger individuals, compared to older individuals, have a higher overall predicted risk. Thus, across subgroups, healthy younger individuals may be incorrectly predicted to be at higher risk than older individuals experiencing CSD. Two additional performance metrics assessed such errors. Specifically, the background-negative-subgroup-positive, or BNSP AUC (Fig. 2b) measured the probability that individuals experiencing CSD (the “positive” label) from a subgroup were correctly predicted to have higher risk than RH (the “negative label”) individuals from other subgroups (“the background”), and the background-positive-subgroup-negative, or BPSN AUC (Fig. 2c), measured the probability RH individuals from a subgroup were correctly predicted to have lower risk than background individuals experiencing CSD.

**Fig. 2: Measuring algorithmic ranking bias.**

The highest performing AI model (a random forest, 100 trees, max depth of 10, balanced class weights, see methods) achieved a median (95% confidence interval, CI) AUC of 0.55 (0.54 to 0.57) across trials. Note that this low AUC was expected: it is comparable to the cross-validation performance of similar depression symptom prediction tools developed in larger, more diverse populations^4,6,13, and motivates the objective of this work to study the reliability of these tools in larger populations.

Figure 3 shows the model results by each metric across subgroups. The Subgroup AUC was lower for males (median, 95% CI 0.52, 0.49 to 0.55), Black/African Americans (0.50, 0.46 to 0.54), individuals from low income families (<$ 20,000, 0.46, 0.43 to 0.50), uninsured (0.45, 0.41 to 0.51), and unemployed (0.46, 0.42 to 0.50) individuals, compared to the median subgroup AUC for each attribute (e.g. “Sex at Birth”) across trials. The BNSP AUC increased with age (from 0.50, 0.46 to 0.52 for 18 to 25 year olds, to 0.67, 0.62 to 0.73 for 65 to 74 year olds), but decreased with family income (from 0.60, 0.58 to 0.63 for individuals from <$ 20,000 income families, to 0.45, 0.42 to 0.48 for individuals from $ 100,000+ income families). Individuals who were White (0.49, 0.46 to 0.52), male (0.52, 0.49 to 0.55), insured (0.47, 0.43 to 0.50), employed (0.43, 0.41 to 0.45), or identified with an “Other” type of employment (0.55, 0.52 to 0.59) also had lower BNSP AUC, compared to the median BNSP AUC for each attribute.

**Fig. 3: Measuring bias in predicted depression risk.**

The BPSN AUC findings showed complementary trends: RH older individuals (e.g. 65 to 74, 0.46, 0.40 to 0.50), unemployed (0.38, 0.36 to 0.41), uninsured (0.47, 0.43 to 0.50), Black/African American (0.48, 0.45 to 0.50), females (0.52, 0.49 to 0.55), and individuals coming from lower income families (e.g. <$ 20,000 0.42, 0.39 to 0.44) had a lower BPSN AUC. Results were reasonably consistent across different types of models, within subgroup base rates (% samples with PHQ-8 ≥ 10) were sometimes, but not always, associated with the BNSP/BPSN AUC, and subgroup sample size did not appear to be associated with the Subgroup AUC (see supplementary materials).

Isolating the effects of subgroup membership

We wished to account for intersectional identities (e.g. female and employed) and isolate the effect of subgroup membership on model underperformance. For an ideal classifier, the predicted risk would be low for RH subgroups, and high for CSD subgroups. In addition, we would expect subgroups with higher base rates (% of samples with PHQ-8 ≥ 10) to have a higher average predicted risk. We thus modeled expected differences from subgroups with either the lowest (for RH) or highest (for CSD) average risk across trials. Generalized estimating equations (GEE, exchangeable correlation structure)⁴⁰, a type of linear regression, was used to estimate the average effect of subgroup membership on the predicted risk after controlling across all other attributes. GEE was used instead of linear regression to correct for the non-independence of repeated samples across trials⁴⁰.

The regression results can be found in Fig. 4. The RH individuals with the lowest average predicted risk were 18 to 25 years old, male, White, had a family income of $ 100,000 + , were insured, and employed. The predicted risk was expected to be higher than these subgroups (95% CI lower-bound >0) for RH individuals who were older than 34 (e.g. for 65 to 74 year olds, mean, 95% confidence interval 0.02, 0.01 to 0.04), identified as Asian/Asian American (0.02, 0.01 to 0.03), Black/African American (0.01, 0.00 to 0.01), came from <$ 60,000 income families (e.g. for <$ 20,000, 0.02, 0.01 to 0.03), were unemployed (0.03, 0.03 to 0.04), and/or on disability (0.01, 0.00 to 0.02).

**Fig. 4: Isolating subgroups where models underperformed.**

For individuals who were experiencing CSD, models predicted the highest average risk for 65 to 74 year olds, Females, Asian/Asian Americans, individuals who came from families with incomes of $ 20,000 to $ 39,999, were insured, and/or retired. The predicted risk for individuals experiencing CSD was expected to be lower (95% CI upper-bound <0) if individuals were 18 to 25 (–0.02, –0.04 to –0.01), male (–0.01, –0.02 to –0.00), Black/African American (–0.02, –0.03 to –0.00), more than one race (–0.02, –0.03 to –0.00), White (–0.02, –0.03 to –0.01), came from any family with an annual income <$ 20,000 or ≥$ 40,000 (e.g. $100,000+ –0.03, –0.03 to –0.02), and/or were employed (–0.02, –0.03 to –0.01). Predicted risk distributions often overlapped across subgroups with higher or lower risk, though there were general trends across subgroups (e.g. the predicted risk increased with age and unemployment in RH individuals, and risk decreased with income level for both CSD and RH individuals, see Fig. 4 for more details).

Interpreting sensed-behaviors

We hypothesized that models underperformed because sensed-behaviors predictive of CSD were inconsistent across subgroups. We thus conducted an analysis to understand differences between how AI tools predicted CSD risk and the different relationships between sensed-behaviors and CSD across subgroups. First, we retrained the AI model on the entire data, and used Shapley additive explanations (SHAP)⁴¹ to interpret how the AI tool predicted CSD risk from sensed-behaviors. We then compared SHAP values with coefficients from explanatory logistic regression models estimating how subgroup membership affected the relationship between each sensed-behavior and depression.

We found different relationships between the SHAP values (Fig. 5a) and sensed-behaviors associated with CSD across subgroups (Fig. 5b, comparisons across each attribute and feature can be found in the supplementary materials). For example, the AI tool predicted that higher morning phone usage (6–12PM) was generally associated with lower predicted depression risk. Higher morning phone usage decreased depression risk for 18 to 25 year olds (mean, 95% CI effect on depression, standardized units: –0.77, –1.07 to –0.47), but increased risk for 65 to 74 year olds (0.60, 0.07 to 1.12). Younger individuals, overall, also had higher morning phone use (standardized median, 95% CI 18 to 25 year olds: 0.32, –2.27 to 1.60) compared to older individuals (65 to 74 year olds: –0.62, –1.96 to 0.76).

**Fig. 5: Interpreting the relationships between sensed-behaviors and depression.**

Figure 5a also shows that specific mobility features, including the circadian movement (regularity in 24 hour movement), location entropy (regularity in travel to unique locations), and the percentage of collected GPS samples in transition (approximated speed >1 km/h) were often associated with lower predicted CSD risk. Circadian movement decreased CSD risk for employed individuals (–0.16, –0.24 to –0.07), but increased CSD risk for individuals who were on disability (0.44, 0.21 to 0.66). Circadian movement and location entropy also decreased depression risk for individuals from middle income ($ 60,000 to $ 99,999) families (circadian movement: –0.21, –0.35 to –0.07; location entropy: –0.34, –0.48 to –0.20), but increased risk for individuals from low income (<$ 20,000) families (circadian movement: 0.30, 0.09 to 0.51; location entropy: 0.35, 0.14 to 0.57). Finally, a higher percentage of GPS samples in transition decreased depression risk for insured individuals (–0.15, –0.22 to –0.08), but increased risk for uninsured individuals (0.32, 0.11 to 0.52).

link