Product BuildingApril 7, 20266 min read

How I Built Netflix & What Now: An AI TV Companion

Point your phone at the TV and ask anything. How I built an AI companion that identifies shows from camera captures and answers contextual questions — using Gemini vision, TMDB, and voice input.

GeminiAI VisionNext.jsPWATMDBProduct

The Annoyance That Started It

You know the moment — you are flipping channels and land on something mid-scene. The show looks interesting, but you have no idea what it is. You could squint at the corner of the screen for a network logo, open Google, and try to describe what you are seeing. Or you could just point your phone at the TV.

That is exactly what Netflix & What Now does. Capture your TV screen, and AI identifies the show in under 2 seconds. Then ask any follow-up question — who is that actor, how many seasons are there, is it worth watching?

Architecture: Camera → Gemini → TMDB → Chat

The flow is straightforward:

Camera capture — the phone's camera captures the TV screen
Gemini vision — Google's Gemini model analyzes the image and identifies the show
TMDB lookup — once identified, we pull rich metadata from The Movie Database (cast, ratings, seasons, similar shows)
Contextual chat — all this context is fed to the AI, so follow-up questions get informed answers

The key insight was using Gemini's vision capability for identification rather than traditional image recognition. Gemini can read on-screen text, recognize actors, identify show graphics, and even interpret scene context — far more robust than matching against a database of screenshots.

Why PWA?

This is a phone-first experience. You are on the couch, phone in hand. A PWA made perfect sense:

Install to home screen — one tap, no app store
Full-screen experience — feels native
Fast loading — Next.js static shell loads instantly
No distribution friction — share a link, not an app store listing

Voice Input: Two Paths

I wanted hands-free interaction (you are holding snacks, remember). The app supports two voice backends:

VocalBridge AI — premium speech-to-text with better accuracy
Web Speech API — free, built into every modern browser, good enough for most queries

Users choose based on their needs. No account required for either — the browser-native option works out of the box.

The BYOK Model

Instead of running my own API keys and dealing with billing, rate limiting, and abuse prevention, Netflix & What Now uses a bring-your-own-keys approach. Users enter their free Gemini and TMDB API keys, which are stored locally in the browser. Benefits:

Zero infrastructure cost — no API bills to manage
No account system — no auth, no database, no user management
Privacy — keys never leave the browser
Free for everyone — both Gemini and TMDB offer generous free tiers

Tech Stack

Next.js with App Router for the framework
Google Gemini via the AI SDK for vision and chat
TMDB API for show metadata
Tailwind CSS + shadcn/ui for a cinematic Netflix-inspired dark UI
PWA with service worker for installability
Vercel for hosting

What I Learned

Gemini vision is remarkably good at identifying TV content from phone camera captures — even at angles, with reflections, or from across the room.
BYOK removes an entire layer of complexity — no billing, no rate limiting, no abuse. Users invest 2 minutes getting API keys and get a free tool forever.
PWAs are underrated for phone-first tools. The install experience is smooth, and you avoid the app store tax entirely.
Voice input adds more value than expected — once people start talking to their phone while watching TV, they do not want to go back to typing.

The project is open source. Try it at netflix-and-what-now.vercel.app or check out the source on GitHub.

Venkata Subramanian Srinivasan

Senior Data Scientist at Asurion | Georgia Tech Alumni

How I Built ArguLens: AI-Powered Legal Document Analysis