DEV Community

Duchan
Duchan

Posted on • Edited on

Giving an LLM Eyes and Hands on a Mobile Simulator

Mobile QA has a scaling problem.

Unit tests and API tests run in CI automatically. But the thing that actually matters to most users — does tapping this button do the right thing, does this screen look right after this flow, does the deeplink open the correct state — none of that runs automatically. Someone has to open the simulator, walk through the steps, and verify. Every time.

The usual answer is Appium or XCUITest. But those require engineers to write and maintain test code that mirrors the UI, breaks whenever the screen changes, and only runs against builds developers already have locally.

We had a different idea. tapflow already lets humans control a simulator through a browser. What if we gave an LLM the same interface?


The interface a human uses

When a person does QA in tapflow, the loop is:

  1. Look at the simulator screen
  2. Decide what to do (tap, swipe, type)
  3. Do it
  4. Look again

This is exactly the perception-action loop that vision-capable LLMs are built for. The model sees a screenshot, reasons about what it shows, decides what action to take, and calls a tool to execute it.

We didn't need to build a new automation layer. We just needed to expose tapflow's existing WebSocket and REST APIs as MCP tools.


What the MCP server does

@tapflowio/mcp-server connects to a running tapflow relay and registers 13 tools that any MCP-compatible client can call:

list_devices       — see all simulators registered on the relay
connect_device     — join a device session
boot_device        — boot a simulator (waits up to 30s for ready state)
screenshot         — capture the current screen
tap                — tap at a pixel coordinate
swipe              — swipe between two coordinates
type_text          — type into the focused field
press_key          — press a keyboard key (Return, Delete, Escape...)
press_button       — press a hardware button (home, lock)
install_app        — install a build from App Center
launch_app         — launch an installed app
list_builds        — list available builds on the relay
disconnect_device  — end the session
Enter fullscreen mode Exit fullscreen mode

Setup is two environment variables:

TAPFLOW_RELAY_URL=wss://your-relay-url
TAPFLOW_TOKEN=your-pat-token
npx @tapflowio/mcp-server
Enter fullscreen mode Exit fullscreen mode

Add it as an MCP server in your client config, and those tools appear in the model's tool list.


How the tools are implemented

Screenshot — the model's eyes

The screenshot tool calls the REST endpoint we added in v0.3.0 (GET /api/v1/sessions/:id/screenshot), gets back a PNG or JPEG buffer, base64-encodes it, and returns it as MCP image content alongside the pixel dimensions:

return {
  content: [
    { type: 'image', data: buf.toString('base64'), mimeType },
    { type: 'text', text: `Screenshot saved: ${filePath} (${width}×${height}px)` },
  ],
}
Enter fullscreen mode Exit fullscreen mode

The model receives the actual image. It can read text on screen, identify UI elements, notice error states — the same things a human would.

Tap and swipe — normalized coordinates

Here's the part that took a few iterations to get right. The simulator's logical coordinate space is different from screenshot pixel coordinates, and it changes with screen resolution, device type, and scale factor.

Rather than exposing logical coordinates (which the model can't reason about without device-specific knowledge), we have the model work entirely in screenshot pixel space. The tap tool takes pixel coordinates plus the screenshot dimensions, then normalizes internally:

// tools.ts
client.tap(sessionId, x / screenshotWidth, y / screenshotHeight)
Enter fullscreen mode Exit fullscreen mode

The model calls screenshot first, reads the dimensions from the response, then uses those same dimensions when calling tap. This means the model can identify "the button is at roughly pixel 200, 450" from the image and tap it directly — no coordinate system translation required.

Swipe works the same way, with 8 interpolated touch:move events across the duration to simulate a natural gesture:

// client.ts — swipe interpolation
const STEPS = 8
const interval = durationMs / STEPS

this.send({ type: 'input:touch:start', sessionId, payload: { x: startX, y: startY } })
for (let i = 1; i < STEPS; i++) {
  await delay(interval)
  const t = i / STEPS
  this.send({
    type: 'input:touch:move',
    sessionId,
    payload: {
      x: Math.round(startX + (endX - startX) * t),
      y: Math.round(startY + (endY - startY) * t),
    },
  })
}
Enter fullscreen mode Exit fullscreen mode

Async operations over WebSocket

Several tools involve async operations — booting a device, installing an app — where the relay sends a confirmation back over WebSocket after the operation completes.

The client uses a waitFor pattern: register a predicate against incoming messages, return a promise that resolves when a matching message arrives, and reject if a timeout fires first.

// client.ts — waitFor
private waitFor(predicate: (msg) => boolean, timeoutMs: number): Promise<RelayMsg> {
  return new Promise((resolve, reject) => {
    const timer = setTimeout(() => {
      this.waiters.splice(this.waiters.findIndex(w => w.resolve === resolve), 1)
      reject(new Error('Request timed out'))
    }, timeoutMs)
    this.waiters.push({ predicate, resolve, reject, timer })
  })
}
Enter fullscreen mode Exit fullscreen mode

boot_device waits up to 30 seconds. install_app waits 60 seconds. Each resolves on the confirmation message or rejects with the error payload.


What a session looks like

A model running a login flow might do this:

1. list_devices → pick a session
2. connect_device
3. list_builds → find the build to test
4. boot_device
5. install_app
6. launch_app
7. screenshot → see the login screen
8. tap(email field coordinates) → focus the input
9. type_text("test@example.com")
10. tap(password field coordinates)
11. type_text("password")
12. tap(login button coordinates)
13. screenshot → verify the home screen loaded
14. disconnect_device
Enter fullscreen mode Exit fullscreen mode

Each screenshot gives the model a chance to verify state before proceeding. If step 13 shows an error message instead of the home screen, the model knows something went wrong.


Where we are: experimental

The version says 0.3.1-experimental.1 for a reason. The tools work, but the layer needs more hardening before we'd call it reliable.

The core issue is consistency. The same sequence of tool calls should produce predictable behavior every time. Right now it doesn't always — there are timing edge cases where an action fires before the UI has fully settled, device state can drift between steps without the model noticing, and error recovery when something unexpected happens mid-flow is rough.

These are solvable problems, but we want to solve them before presenting this as something teams should build pipelines on.


Where we're going: CI/CD without a QA script

The direction we're aiming at is using the MCP server as the foundation for LLM-driven smoke tests in CI.

The scenario: a new build passes unit tests and gets uploaded to App Center. A CI step spins up the MCP server, points it at the relay, and gives a model a natural-language test spec:

"Install the latest build. Log in with test credentials. Navigate to the cart, add an item, and confirm the checkout screen shows the correct total. Take a screenshot at each step."

The model does the steps, captures evidence, and reports what it saw. No automation code to write. No selectors to maintain when the UI changes. The spec is just a description of what a human would do.

This isn't production-ready yet. The stability work comes first. But the pieces — browser-controllable simulators, screenshot REST endpoint, MCP tool layer — are in place. The question is whether the model can run a flow reliably enough to be trusted in CI without a human verifying each run.

We think it can. That's what we're building toward.


Try the MCP server (experimental)

npm install -g @tapflowio/mcp-server@experimental
Enter fullscreen mode Exit fullscreen mode

You'll need a running tapflow relay and a PAT token with viewer scope. Configure it in your MCP client:

{
  "mcpServers": {
    "tapflow": {
      "command": "npx",
      "args": ["@tapflowio/mcp-server"],
      "env": {
        "TAPFLOW_RELAY_URL": "wss://your-relay-url",
        "TAPFLOW_TOKEN": "your-pat-token"
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

If you try it and hit rough edges, open an issue — that feedback is exactly what's shaping the stability work.

Top comments (2)

Collapse
 
harjjotsinghh profile image
Harjot Singh

"We didn't need a new automation layer, we just exposed the existing WebSocket/REST APIs as MCP tools" is the quietly important lesson, the value wasn't reinventing automation, it was giving the model a clean perception-action surface over what already worked. The screenshot-reason-act loop maps perfectly to vision LLMs, and mobile QA is a great fit because the human loop is already exactly that. The thing I'd watch in practice is the perception half: vision models are confident readers of UI but they misread state more than people expect (a disabled button that looks enabled, a toast that already dismissed), so the failure mode is the agent acting on a screen it misperceived. The guardrail that earns its keep is verifying the post-action state matches the intent before moving on, assert the screen changed the way it should, don't assume the tap landed. That perceive-act-verify discipline is exactly how I think about agent automation in Moonshift. With 13 tools, are you feeding the model raw screenshots, or also an accessibility/structure dump so it isn't relying purely on pixels?

Collapse
 
joduchan profile image
Duchan

Appreciate this — you said the core lesson better than the post did: the win was a clean perception-action surface over what already worked, not a new automation layer.

To your question: right now it's raw screenshots only. The model works in screenshot pixel space — reads the image, picks a coordinate, and the tool normalizes it (x / screenshotWidth). No structure dump yet. So you've named the exact soft spot: pixels can't disambiguate "disabled-but-looks-enabled" or "toast already gone," and the agent is one confident misread away from acting on the wrong screen.

Where I want to take it is an optional structure layer beside the pixels — with a fun wrinkle for tapflow. The human touch path deliberately avoids WebDriverAgent (raw HID injection — no signing, survives restarts). But the agent path is exactly where something WDA-like earns its keep: an opt-in extension that pulls the iOS accessibility / Android view tree, giving the model element identity + bounds + enabled state. That cuts the misread mode (cross-check what it sees vs what the tree says) and lets it tap by element instead of a guessed pixel. It also sharpens your perceive-act-verify point — the post-action assertion can check the tree, not just diff pixels.

Opt-in is the point, though: the base path's appeal is no WDA/signing, so structure would be an extension you flip on for agents, not a QA dependency. Honest status — direction, not shipped; the MCP layer is still experimental and consistency comes first.

How do you handle it in Moonshift — structure alongside pixels, or pixel-first with hard verification?