April 11, 2026
Measuring algorithmic bias to analyze the reliability of AI tools that predict depression risk using smartphone sensed-behavioral data

Data collection

We analyzed data from a U.S.-based, NIMH-funded study conducted from 2019–2021 to identify associations between behavioral data collected from smartphones and depression symptoms3,25,26,27,28,29. Smartphone sensed-behavioral data on GPS location, phone usage (timestamp of screen unlock), and sleep were near-continuously collected from participants across the United States for 16 weeks and the PHQ-8, a self-reported measure of two week depression symptoms33,34 frequently used in mental health research3,5,25,27, was administered multiple times a week every three weeks (on weeks 1, 4, 7, …, known as weekly reporting periods). Sensed-behaviors were summarized over two weeks to align with collected PHQ-8 depression symptoms for prediction (see Table 1). For example, sensed-behaviors collected during weeks 3 and 4 were summarized to predict PHQ-8 responses collected during week 4.

Table 2 summarizes the data used for analysis. 3900 samples were analyzed from 650 individuals, a large cohort and sample size compared to most studies to date analyzing associations between sensed-behaviors and mental health4,5,25,35,36. A sample was a set of sensed-behaviors, summarized over 2 weeks, corresponding to the average PHQ-8 response collected during a single weekly reporting period. 46% of the average self-reported PHQ-8 values were ≥10, indicating clinically-significant depression (CSD)33. The majority of participants were relatively young to middle aged (75% 25 to 54 years old), female (74%), white (82%), middle to high income (61% annual family income ≥$ 40,000), insured (93%) and employed (62%). We focused our results on subgroups with at least 15 participants37. The sensed-behavior distributions across the population for each subgroup can be found in the supplementary materials.

Identifying subgroups where AI models underperform

The PHQ-8 asked participants to self-report depression symptoms experienced over 14 days, and PHQ-8’s were delivered multiple times throughout each weekly reporting period. We trained AI models using 14 days of smartphone sensed-behavioral data to predict if the average PHQ-8 value across each weekly reporting period (days 7 through 14, see Fig. 1c) indicated clinically-significant depression (CSD, PHQ-8 score ≥1033) symptoms. While the PHQ-8 asks participants to self-report 2 week depression symptoms, studies suggest that individual assessments may suffer from recency bias38 or indicate “briefly” elevated depression symptoms39. For this reason, PHQ-8 values were averaged over each weekly reporting period to predict a more stable estimate of self-reported symptoms.

Model performance was assessed by performing 5-fold cross-validation, partitioning on subjects, and predictions across folds were concatenated to calculate model performance. Similar to prior work4,6, within each cross-validation split, models were trained using data collected from 80% of the participants (520 participants), and the trained model was applied to predict CSD in the remaining 20% (130 participants). To analyze performance variability due to specific cross-validation splits, we performed 100 cross-validation trials, shuffling participants into different folds during each trial.

AI models output a predicted risk score from 0–1 of experiencing CSD. We used the predicted risk to calculate common ranking bias metrics20,21,22 (Fig. 2) across the subgroups in Table 2. These metrics were based upon the area under the receiver operating curve (AUC), which measured the probability models correctly predicted that CSD samples were ranked higher (in the predicted risk) than relatively healthy (RH, PHQ-8 < 10) samples. We first calculated the AUC within each subgroup (the “Subgroup AUC”). Note that equal Subgroup AUCs do not guarantee high AUC across an entire sample. For example, Fig. 2a shows simulated data where an algorithm correctly predicted CSD risk within subgroups, but younger individuals, compared to older individuals, have a higher overall predicted risk. Thus, across subgroups, healthy younger individuals may be incorrectly predicted to be at higher risk than older individuals experiencing CSD. Two additional performance metrics assessed such errors. Specifically, the background-negative-subgroup-positive, or BNSP AUC (Fig. 2b) measured the probability that individuals experiencing CSD (the “positive” label) from a subgroup were correctly predicted to have higher risk than RH (the “negative label”) individuals from other subgroups (“the background”), and the background-positive-subgroup-negative, or BPSN AUC (Fig. 2c), measured the probability RH individuals from a subgroup were correctly predicted to have lower risk than background individuals experiencing CSD.

Fig. 2: Measuring algorithmic ranking bias.
figure 2

We considered three metrics from prior work to assess algorithmic ranking bias20,21,22. The predicted risk is the probability, output by the AI tool, that individuals were experiencing clinically-significant depression (CSD). Histograms show simulated example predictions from an AI tool, describing the count of individuals (y-axis) who fell into a predicted risk bin (x-axis). Colors indicate individuals experiencing CSD (orange) versus RH (light-blue). Plots are split by age subgroups (younger/older). The AUC is the area under the receiver operating curve. The red and dark-blue boxes, and corresponding text color below each plot, highlight the subgroups compared for each metric. a The high Subgroup AUCs show that the predicted risk for individuals experiencing CSD was greater than the predicted risk for relatively healthy (RH) individuals within both age subgroups. But, this AI tool was biased to predict higher risk for younger individuals, overall, than older individuals. This bias is quantified using the (b) Background-Negative-Subgroup-Positive (BNSP) AUC and (c) Background-Positive-Subgroup-Negative (BPSN) AUC, which respectively show that younger individuals with CSD (“positive samples”) were correctly ranked higher (high BNSP) than RH (“negative samples”) samples from all other subgroups (older individuals, the “background”), but RH younger individuals were incorrectly ranked higher (low BPSN) than background samples with CSD. Older individuals show the complementary result (low BNSP, high BPSN). This bias reduces the model AUC when measured across the entire sample (assuming equal number of older and younger individuals, AUC = 0.75), compared to the AUC in each subgroup (1.00).

The highest performing AI model (a random forest, 100 trees, max depth of 10, balanced class weights, see methods) achieved a median (95% confidence interval, CI) AUC of 0.55 (0.54 to 0.57) across trials. Note that this low AUC was expected: it is comparable to the cross-validation performance of similar depression symptom prediction tools developed in larger, more diverse populations4,6,13, and motivates the objective of this work to study the reliability of these tools in larger populations.

Figure 3 shows the model results by each metric across subgroups. The Subgroup AUC was lower for males (median, 95% CI 0.52, 0.49 to 0.55), Black/African Americans (0.50, 0.46 to 0.54), individuals from low income families (<$ 20,000, 0.46, 0.43 to 0.50), uninsured (0.45, 0.41 to 0.51), and unemployed (0.46, 0.42 to 0.50) individuals, compared to the median subgroup AUC for each attribute (e.g. “Sex at Birth”) across trials. The BNSP AUC increased with age (from 0.50, 0.46 to 0.52 for 18 to 25 year olds, to 0.67, 0.62 to 0.73 for 65 to 74 year olds), but decreased with family income (from 0.60, 0.58 to 0.63 for individuals from <$ 20,000 income families, to 0.45, 0.42 to 0.48 for individuals from $ 100,000+ income families). Individuals who were White (0.49, 0.46 to 0.52), male (0.52, 0.49 to 0.55), insured (0.47, 0.43 to 0.50), employed (0.43, 0.41 to 0.45), or identified with an “Other” type of employment (0.55, 0.52 to 0.59) also had lower BNSP AUC, compared to the median BNSP AUC for each attribute.

Fig. 3: Measuring bias in predicted depression risk.
figure 3

Bias was assessed by measuring the area under the receiver operating curve comparing positive (clinically-significant depression, CSD) and negative (relatively healthy, RH) samples within subgroups (Subgroup AUC, left column), subgroup positive samples to negative samples from all other subgroups, called “the background” (background-negative-subgroup-positive, or BNSP AUC, middle column), and subgroup negative samples to background positive samples (background-positive-subgroup-negative, or BPSN AUC, right column)20,22. Point values indicate the median value across trials. Error bars show 95% confidence intervals (2.5 and 97.5 percentiles). Dotted lines and shaded areas show the distribution (median and 95% confidence intervals) of either the median (if >2 subgroups) or highest performing subgroups across trials.

The BPSN AUC findings showed complementary trends: RH older individuals (e.g. 65 to 74, 0.46, 0.40 to 0.50), unemployed (0.38, 0.36 to 0.41), uninsured (0.47, 0.43 to 0.50), Black/African American (0.48, 0.45 to 0.50), females (0.52, 0.49 to 0.55), and individuals coming from lower income families (e.g. <$ 20,000 0.42, 0.39 to 0.44) had a lower BPSN AUC. Results were reasonably consistent across different types of models, within subgroup base rates (% samples with PHQ-8 ≥ 10) were sometimes, but not always, associated with the BNSP/BPSN AUC, and subgroup sample size did not appear to be associated with the Subgroup AUC (see supplementary materials).

Isolating the effects of subgroup membership

We wished to account for intersectional identities (e.g. female and employed) and isolate the effect of subgroup membership on model underperformance. For an ideal classifier, the predicted risk would be low for RH subgroups, and high for CSD subgroups. In addition, we would expect subgroups with higher base rates (% of samples with PHQ-8 ≥ 10) to have a higher average predicted risk. We thus modeled expected differences from subgroups with either the lowest (for RH) or highest (for CSD) average risk across trials. Generalized estimating equations (GEE, exchangeable correlation structure)40, a type of linear regression, was used to estimate the average effect of subgroup membership on the predicted risk after controlling across all other attributes. GEE was used instead of linear regression to correct for the non-independence of repeated samples across trials40.

The regression results can be found in Fig. 4. The RH individuals with the lowest average predicted risk were 18 to 25 years old, male, White, had a family income of $ 100,000 + , were insured, and employed. The predicted risk was expected to be higher than these subgroups (95% CI lower-bound >0) for RH individuals who were older than 34 (e.g. for 65 to 74 year olds, mean, 95% confidence interval 0.02, 0.01 to 0.04), identified as Asian/Asian American (0.02, 0.01 to 0.03), Black/African American (0.01, 0.00 to 0.01), came from <$ 60,000 income families (e.g. for <$ 20,000, 0.02, 0.01 to 0.03), were unemployed (0.03, 0.03 to 0.04), and/or on disability (0.01, 0.00 to 0.02).

Fig. 4: Isolating subgroups where models underperformed.
figure 4

For an ideal classifier, the predicted risk would be low for relatively healthy (RH) individuals, and high for individuals with clinically-significant depression (CSD). We thus modeled expected differences from the subgroups with either the lowest (for RH, left) or highest (for CSD, right) average predicted risk across trials. Subgroup effects were calculated using generalized estimating equations (GEE)40, a type of linear model, to analyze the average effect of subgroup membership on the predicted risk, controlling across all attributes. GEE accounted for the non-independence of repeated samples across trials40. Separate regression models were created for each outcome (RH, CSD) to remove the effects of the subgroup base rate. Points represent the GEE coefficient (expected effect), and error bars are 95% confidence intervals around the estimated effect. Dotted vertical lines highlight an expected subgroup effect of 0.

For individuals who were experiencing CSD, models predicted the highest average risk for 65 to 74 year olds, Females, Asian/Asian Americans, individuals who came from families with incomes of $ 20,000 to $ 39,999, were insured, and/or retired. The predicted risk for individuals experiencing CSD was expected to be lower (95% CI upper-bound <0) if individuals were 18 to 25 (–0.02, –0.04 to –0.01), male (–0.01, –0.02 to –0.00), Black/African American (–0.02, –0.03 to –0.00), more than one race (–0.02, –0.03 to –0.00), White (–0.02, –0.03 to –0.01), came from any family with an annual income <$ 20,000 or ≥$ 40,000 (e.g. $100,000+ –0.03, –0.03 to –0.02), and/or were employed (–0.02, –0.03 to –0.01). Predicted risk distributions often overlapped across subgroups with higher or lower risk, though there were general trends across subgroups (e.g. the predicted risk increased with age and unemployment in RH individuals, and risk decreased with income level for both CSD and RH individuals, see Fig. 4 for more details).

Interpreting sensed-behaviors

We hypothesized that models underperformed because sensed-behaviors predictive of CSD were inconsistent across subgroups. We thus conducted an analysis to understand differences between how AI tools predicted CSD risk and the different relationships between sensed-behaviors and CSD across subgroups. First, we retrained the AI model on the entire data, and used Shapley additive explanations (SHAP)41 to interpret how the AI tool predicted CSD risk from sensed-behaviors. We then compared SHAP values with coefficients from explanatory logistic regression models estimating how subgroup membership affected the relationship between each sensed-behavior and depression.

We found different relationships between the SHAP values (Fig. 5a) and sensed-behaviors associated with CSD across subgroups (Fig. 5b, comparisons across each attribute and feature can be found in the supplementary materials). For example, the AI tool predicted that higher morning phone usage (6–12PM) was generally associated with lower predicted depression risk. Higher morning phone usage decreased depression risk for 18 to 25 year olds (mean, 95% CI effect on depression, standardized units: –0.77, –1.07 to –0.47), but increased risk for 65 to 74 year olds (0.60, 0.07 to 1.12). Younger individuals, overall, also had higher morning phone use (standardized median, 95% CI 18 to 25 year olds: 0.32, –2.27 to 1.60) compared to older individuals (65 to 74 year olds: –0.62, –1.96 to 0.76).

Fig. 5: Interpreting the relationships between sensed-behaviors and depression.
figure 5

a Shapley additive explanations (SHAP)41 were used to interpret how the AI tool predicted depression risk using sensed-behaviors. Sensed-behaviors are ordered, descending, on the y-axis by their average impact on the predicted risk (the “SHAP value”, x-axis). Only the top 10 sensed-behaviors with the highest average impact are listed, for space. Colors dictate whether a higher sensed-behavior “feature” value (red) is associated with higher or lower predicted risk. For example, higher average (“Avg”) phone unlocks from 6–12 PM were generally associated with lower predicted risk. Averages and deviations summarize sensed-behaviors over 14 days (see Fig. 1c). b Example coefficients (β, 95% CI, standardized units) from explanatory logistic regression models estimating the associations between sensed-behaviors and depression across subgroups, as well as the median and 95% CI of the sensed-behavior distribution. Full coefficients and statistics can be found in the supplementary materials.

Figure 5a also shows that specific mobility features, including the circadian movement (regularity in 24 hour movement), location entropy (regularity in travel to unique locations), and the percentage of collected GPS samples in transition (approximated speed >1 km/h) were often associated with lower predicted CSD risk. Circadian movement decreased CSD risk for employed individuals (–0.16, –0.24 to –0.07), but increased CSD risk for individuals who were on disability (0.44, 0.21 to 0.66). Circadian movement and location entropy also decreased depression risk for individuals from middle income ($ 60,000 to $ 99,999) families (circadian movement: –0.21, –0.35 to –0.07; location entropy: –0.34, –0.48 to –0.20), but increased risk for individuals from low income (<$ 20,000) families (circadian movement: 0.30, 0.09 to 0.51; location entropy: 0.35, 0.14 to 0.57). Finally, a higher percentage of GPS samples in transition decreased depression risk for insured individuals (–0.15, –0.22 to –0.08), but increased risk for uninsured individuals (0.32, 0.11 to 0.52).

link

Leave a Reply

Your email address will not be published. Required fields are marked *