All Posts
Computer VisionApril 5, 20256 min read

Boosting Photo App Retention with Face Detection and Clustering

Using DSFD, InsightFace, and DBSCAN to surface the photos people actually care about — and how it moved 30-day retention by up to 150 basis points.

Computer VisionInsightFaceClusteringRetentionAWS

The Hypothesis

People don't care about all their photos. They care about photos of people they love. That was the bet behind our retention work for AT&T Photos and Verizon Photos: if we surface the photos that matter most — the ones with familiar faces — users will come back more often.

Detection: DSFD

We needed a face detector that handles real-world photo library conditions: small faces in group shots, side profiles, partial occlusion, wildly varying lighting. We went with Dual Shot Face Detector (DSFD) for its multi-scale feature processing. A single family photo might contain faces ranging from 20 pixels to 200 pixels wide — DSFD handles this range without needing separate detection passes.

Recognition: InsightFace

Once faces are detected, InsightFace (trained with ArcFace loss) generates 512-dimensional embeddings. Three properties mattered for our use case:

  • High inter-class variance — different people need to land far apart in embedding space
  • Low intra-class variance — the same person across different photos needs to cluster tightly
  • Robustness to aging — family photo libraries span years, and children's faces change dramatically

We fine-tuned on a diverse dataset to improve accuracy on underrepresented demographics. When your product serves millions of people, equitable performance isn't a nice-to-have.

Clustering: DBSCAN

With embeddings computed, DBSCAN was the natural clustering choice:

  • No cluster count required — we don't know how many people are in someone's library
  • Noise handling — not every face belongs to a meaningful cluster (strangers in backgrounds, faces on magazine covers)
  • Density-based grouping — naturally handles the distribution where some people appear in 500 photos and others in 5

Tuning epsilon was the critical knob. Too low and you split one person into multiple clusters; too high and you merge different people. We built a validation set from manually-labeled photo libraries and found the sweet spot empirically.

Product Integration

The face clusters powered three user-facing features:

  1. People albums — auto-generated, grouped by person, user-nameable
  2. Memories with faces — "This day last year" surfaces photos of recognized people, not random screenshots
  3. Smart notifications — "You have 12 new photos of [person]" instead of generic "Back up your photos" nudges

Impact

App30-Day Retention Improvement
AT&T Photos+30 bps
Verizon Photos+150 bps

The gap between AT&T and Verizon came down to baseline engagement — Verizon's user base had more room to move.

In a subscription business with millions of users, 150 basis points compounds dramatically. Each retained user represents years of subscription revenue. The infrastructure cost of the CV pipeline was a rounding error compared to the retained lifetime value.

What Surprised Us

Detection is the bottleneck, not recognition. A missed face can never be embedded or clustered. We spent 60% of optimization time on detection quality — recall mattered more than precision here. Privacy by design wasn't optional. Face embeddings are stored on-device. We never transmit raw face data to servers. Clustering runs locally via Core ML and TFLite. This was a product requirement from day one, and it shaped every architectural decision downstream. Clustering evaluation is genuinely hard. Standard metrics like NMI and ARI require ground truth labels. We built a lightweight labeling tool and manually validated cluster quality for a sample of users each week. There's no shortcut here — you have to look at the results.
VS
Venkata Subramanian Srinivasan
Senior Data Scientist at Asurion | Georgia Tech Alumni
Share