All Posts
Product BuildingApril 7, 20266 min read

How I Built Netflix & What Now: An AI TV Companion

Point your phone at the TV and ask anything. How I built an AI companion that identifies shows from camera captures and answers contextual questions — using Gemini vision, TMDB, and voice input.

GeminiAI VisionNext.jsPWATMDBProduct

The Annoyance That Started It

You know the moment — you are flipping channels and land on something mid-scene. The show looks interesting, but you have no idea what it is. You could squint at the corner of the screen for a network logo, open Google, and try to describe what you are seeing. Or you could just point your phone at the TV.

That is exactly what Netflix & What Now does. Capture your TV screen, and AI identifies the show in under 2 seconds. Then ask any follow-up question — who is that actor, how many seasons are there, is it worth watching?

Architecture: Camera → Gemini → TMDB → Chat

The flow is straightforward:

  1. Camera capture — the phone's camera captures the TV screen
  2. Gemini vision — Google's Gemini model analyzes the image and identifies the show
  3. TMDB lookup — once identified, we pull rich metadata from The Movie Database (cast, ratings, seasons, similar shows)
  4. Contextual chat — all this context is fed to the AI, so follow-up questions get informed answers

The key insight was using Gemini's vision capability for identification rather than traditional image recognition. Gemini can read on-screen text, recognize actors, identify show graphics, and even interpret scene context — far more robust than matching against a database of screenshots.

Why PWA?

This is a phone-first experience. You are on the couch, phone in hand. A PWA made perfect sense:

  • Install to home screen — one tap, no app store
  • Full-screen experience — feels native
  • Fast loading — Next.js static shell loads instantly
  • No distribution friction — share a link, not an app store listing

Voice Input: Two Paths

I wanted hands-free interaction (you are holding snacks, remember). The app supports two voice backends:

  • VocalBridge AI — premium speech-to-text with better accuracy
  • Web Speech API — free, built into every modern browser, good enough for most queries

Users choose based on their needs. No account required for either — the browser-native option works out of the box.

The BYOK Model

Instead of running my own API keys and dealing with billing, rate limiting, and abuse prevention, Netflix & What Now uses a bring-your-own-keys approach. Users enter their free Gemini and TMDB API keys, which are stored locally in the browser. Benefits:

  • Zero infrastructure cost — no API bills to manage
  • No account system — no auth, no database, no user management
  • Privacy — keys never leave the browser
  • Free for everyone — both Gemini and TMDB offer generous free tiers

Tech Stack

  • Next.js with App Router for the framework
  • Google Gemini via the AI SDK for vision and chat
  • TMDB API for show metadata
  • Tailwind CSS + shadcn/ui for a cinematic Netflix-inspired dark UI
  • PWA with service worker for installability
  • Vercel for hosting

What I Learned

  1. Gemini vision is remarkably good at identifying TV content from phone camera captures — even at angles, with reflections, or from across the room.
  2. BYOK removes an entire layer of complexity — no billing, no rate limiting, no abuse. Users invest 2 minutes getting API keys and get a free tool forever.
  3. PWAs are underrated for phone-first tools. The install experience is smooth, and you avoid the app store tax entirely.
  4. Voice input adds more value than expected — once people start talking to their phone while watching TV, they do not want to go back to typing.

The project is open source. Try it at netflix-and-what-now.vercel.app or check out the source on GitHub.

VS
Venkata Subramanian Srinivasan
Senior Data Scientist at Asurion | Georgia Tech Alumni
Share