Voice-first PC Builder Agent Built on Gemini Live API

#geminiliveagentchallenge #googlecloud #ai #hackathon

This post was written for my submission to the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

Inspiration

I've built PCs before and know the pain firsthand from squinting at tiny connectors, pausing YouTube every 10 seconds with greasy fingers, and second-guessing whether you're about to break your build. But the real breaking point was trying to help a friend build their first PC remotely over a video call. I'm watching a shaky camera feed, trying to describe which cable goes where, and they can't find the front panel headers or find the parts I'm referencing unless I pull up pictures. I realized what they needed wasn't me on a video call, it was an AI that could see what they see, talk them through it hands-free, and not lose patience on the 15th, "which one is the 8-pin?"

What it does

BuildBuddy is a hands-free, voice-first AI assistant that guides users through a PC build in real time. You talk to it, it talks back, no typing, no pausing, no touching your phone with thermal-paste fingers. Point your camera at a part or connector you don't recognize, and it identifies it visually. A built-in parts reference shows labeled diagrams for tricky connectors like PSU cables. It tracks your build progress step-by-step and logs every action with timestamps and camera snapshots to a shareable timeline so a friend, mentor, or forum can review exactly what happened and verify the AI's guidance. Because AI can be wrong, and accountability matters.

How I built it

Gemini Live API: for real-time bidirectional audio streaming, the user speaks, the AI responds with voice, all over a persistent WebSocket connection
Google ADK (Agent Development Kit): for agent orchestration, tool definitions, and session management
Custom tool calls: for build progress tracking (update_part_status), connector image references (get_connector_image, show_user_part), enabling the AI to trigger UI updates mid-conversation
Cloud Run: for deployment with session affinity to maintain WebSocket connections
Cloud Build: for automated CI/CD, pushes to the repo trigger a Docker image build and deployment to Cloud Run without manual steps
Firestore: for logging every build event with timestamps, part status, and notes
Google Cloud Storage: for storing camera snapshots at each build step
Vanilla HTML/CSS/JS frontend: for no frameworks, mobile-first, designed for one-handed use with a phone propped up next to your build

The ToolContext state persistence across conversation turns is what makes it a stateful agent vs. a stateless function caller.

Here's what that looks like:

# app/tools/update_part_status.py lines 23-60
def update_part_status(
    tool_context: ToolContext,
    part_id: str,
    status: Literal["NOT_STARTED", "IN_PROGRESS", "DONE", "BLOCKED"],
    notes: str = "",
) -> dict:
    build = tool_context.state.get("build_progress", {})
    build[part_id]["status"] = status
    build[part_id]["notes"] = notes
    tool_context.state["last_updated_part"] = part_id
    tool_context.state["build_progress"] = build
    ...

Since ADK's after_tool_callback only receives tool_context, we store last_updated_part as a state breadcrumb so the Firestore logger knows which delta to snapshot without storing the full build every time.

Challenges I ran into

Real-time audio was the hardest part of the entire project. The PCM microphone recorder would flood the WebSocket with audio buffer data, causing crashes, and Nvidia Broadcast made it worse with suspected buffer timing issues. Muting the mic after the speech and eventually bypassing Nvidia Broadcast almost resolved the symptoms, but it took real debugging time to isolate. On top of that, camera frames are constantly sent during tool call execution, which would interrupt the audio data flow, causing the WebSocket to lose connection entirely. The Gemini Live API and ADK expect very specific WebSocket data timing, so if tool calls and camera frames collide with the audio stream, everything falls apart. The fix required async waits and blocking mechanisms to prevent simultaneous data floods.

# Tool call websocket fix
# app/main.py ~lines 190-310
frames_allowed = asyncio.Event()
frames_allowed.set()

# downstream_task: lock when tool starts
if event.get_function_calls():
    frames_allowed.clear()  # LOCK realtime input

# unlock after turn completes + cooldown
if getattr(event, "turn_complete", False):
    await asyncio.sleep(1.5) # important in case of consec toolcalls
    frames_allowed.set()

# upstream_task: gate audio and camera behind the lock
if not frames_allowed.is_set():
    continue  # drop frame/audio while tool is running

We always send the camera feed at 1fps to Gemini, not at the raw capture rate from the frontend, and always send the most recent frame rather than queuing every frame.

Client sends every captured frame to the WebSocket
Server buffers only the latest in latest_image_blob, overwriting constantly
frame_injection_worker forwards to Gemini at a 1s interval, but only if the frame changed (tracked by bytes identity, not object identity)

# Camera feed
# app/main.py
async def frame_injection_worker():
    last_sent_data = None
    while True:
        await frames_allowed.wait() # blocked during tool calls
        await asyncio.sleep(1) # hard limit on 1fps in the backend
        if (latest_image_blob is not None
                and latest_image_blob.data is not last_sent_data):
            live_request_queue.send_realtime(latest_image_blob)
            last_sent_data = latest_image_blob.data

Audio buffering fix

let pcmBuffer = [];
let pcmBufferBytes = 0;

const flushTimer = setInterval(flushBuffer, SEND_INTERVAL_MS); // ~50ms

audioRecorderNode.port.onmessage = (event) => {
    const pcmData = convertFloat32ToPCM(event.data);
    pcmBuffer.push(pcmData); // accumulate
    pcmBufferBytes += pcmData.byteLength;
};

function flushBuffer() {
    if (pcmBufferBytes === 0) {
            return;
    }

    // accumulate
    const merged = new Uint8Array(pcmBufferBytes);
    let offset = 0;
    for (const chunk of pcmBuffer) {
        merged.set(new Uint8Array(chunk), offset);
        offset += chunk.byteLength;
    }

    // reset 
    pcmBuffer = [];
    pcmBufferBytes = 0;

    audioRecorderHandler(merged.buffer);
}

These fixes together, the async gate for camera feed and audio, and the PCM buffer stabilized the WebSocket connection.

Cloud Run deployment was its own adventure. The app failed immediately with a single concurrent instance because the frontend must query JavaScript and static files while the WebSocket maintains the connection. With one instance, we'd get rate-limited on our own requests. The solution was setting a minimum of one warm instance and a max of two, plus enabling session affinity to keep WebSocket connections stable. GCS bucket configuration for public image access was also non-obvious. Disabling the "prevent public access" setting isn't enough, and you need to explicitly grant the allUsers principal a Storage Object Viewer role. I would have preferred per-object public access, but the bucket's uniform access policy doesn't allow mixed permissions.

Finally, ADK documentation didn't always match the actual behavior of the current version, which meant a lot of trial-and-error to figure out how things actually worked versus how they were documented.

Accomplishments that we're proud of

The thing I'm most proud of is that it actually works end-to-end as a real-time voice agent. You can have a natural conversation with your hands full while building a PC, and the AI genuinely helps. The shareable build timeline turned out to be a surprisingly compelling feature. It reframes AI assistance from "trust the black box" to "here's a reviewable record of everything that happened."

The ADK before/after callbacks power the whole thing:

# app/tools/update_part_status.py
def before_tool_modifier(tool, args, tool_context):
    if tool.name == "update_part_status":
        b64, mime = get_latest_image()
        tool_context.state["pending_blob"] = {"data": b64, ...}

def after_tool_report_log(tool, args, tool_context, tool_response):
    if tool.name == "update_part_status":
        snapshot = _build_snapshot(tool, tool_context)

        # Phase 1: insert with gcs_url="PENDING"
        doc_id = _log_to_firestore(snapshot) 
        blob_url = _upload_blob_to_gcs(tool_context)

        # Phase 2: update with image url from gcs
        _update_firestore_record(doc_id, blob_url)

What I learned

This was my deep dive into Gemini's Live API and real-time multimodal streaming. The biggest takeaway is that voice-first interfaces have fundamentally different UX constraints than text-based ones, latency matters more than token count, interruption handling is critical, and you can't show a wall of text to someone whose hands are inside a PC case. I also learned that tool calls in a streaming audio context need careful state management to prevent race conditions between camera frames, audio chunks, and tool executions happening simultaneously. On the deployment side, Cloud Run's session affinity is non-negotiable for WebSocket applications, something that's obvious in hindsight but costs real debugging time.

What's next for BuildBuddy

The current version is scoped as a focused single-user experience, but for production:

Multi-user sessions: right now there's no session isolation. Adding proper authentication and per-session Firestore collections would let multiple people use BuildBuddy simultaneously, each with their own build timeline
AR connector overlay: use the camera feed to overlay labels directly on the motherboard showing exactly where each cable connects
Build templates: pre-loaded guides for popular builds so the AI has part-specific knowledge from the start, with compatibility checking to warn about mismatches before you start building
Motherboard manual RAG: use embeddings to index motherboard manuals so the AI can reference exact pin layouts, header locations, and BIOS settings specific to the user's board since the motherboard manual is the real source of truth for every build
Community review: let experienced builders comment on shared timelines to catch AI mistakes, suggest better cable routing, or help troubleshoot blocked steps
Multi-language support: the voice assistant should support more than just English
Noise robustness: the hackathon demo runs in a quiet space, but a real user has fans, tools, and background noise. Proper noise cancellation and wake-word detection would be essential
Production security: the hackathon version uses broad permissions and a public GCS bucket for convenience. In production, I'd use presigned URLs with expiration for snapshot access, lock down Firestore rules per user, and apply least-privilege IAM roles for each service account across Cloud Run, Firestore, and GCS