From brief to regression suite in minutes — not weeks.

Paste a sentence about your app. The bench crawls it, builds a feature tree, generates a regression suite grounded in real selectors, runs it on a schedule, and turns every failure into a structured diagnosis your team can act on.

Request a demo See architecture

InsightTestBench QA Command Center — dashboard overview with tests-today / pass-rate / open-failures / runs-in-flight tiles, a recent-runs table showing per-run status and persona, and a top-open-failures column with flake / env-issue / bug classifications and confidence scores

Why regression suites rot

Every team starts with the same plan: ship the feature, then add automation. By the third release, the test suite is a graveyard of flaky cases referencing selectors that don't exist. The bench fixes the loop end-to-end — from generating cases that match real selectors to explaining failures when they happen.

The usual path · sprint after sprint

Write tests → app changes → tests rot → trust dies

QA opens the app, writes a Playwright test with hand-picked selectors.
Test passes locally. Get it in CI. Celebrate.
Dev refactors the page — selectors change, button text changes.
Half the suite goes red. Triage takes a day.
Some failures are real bugs, most are "selector wrong" or "wait too short." Nobody can tell which without re-running locally.
Team disables the flaky ones. The "skip" list grows every sprint.
Six months in, half the regression suite is `@skip`'d and the rest only catches the bugs nobody was making anyway.
Manager schedules another "test stabilization sprint" that gets deprioritized for feature work.

The InsightTestBench path · ongoing

Brief → suite → scheduled runs → diagnosed failures → fixed

Paste a sentence about the app. Bench crawls + builds a feature tree.
Per-feature cases are generated from actual screenshots + DOM of the live page — selectors that exist, button text you can see.
Every case ends with explicit assert_* verbs (text, url, aria_invalid, no_console_errors) instead of fuzzy expected-text matching.
Schedule the suite to run nightly against staging.
When a case fails, the RCA agent reads the case spec + execution log + page state and classifies: test_design vs product_bug vs environmental.
Webhook hits Slack only on regressions, with the suggested fix already attached.
When a feature changes, click Vision regen on that plan to update cases from the new page — no manual rewrite.
Trust comes back. Skip list shrinks. Releases ship without holding breath.

Pillar 1 · Bootstrap

Brief in. Test suite out.

Paste a sentence about your app and the URL it lives at. The bench logs in, crawls every menu and sub-page, captures a screenshot of each, builds a feature tree, derives the personas, and generates a per-feature regression suite — plus a responsive-design pass and a UX audit. You watch it happen on the project page.

Live crawl — real Playwright sessions, real screenshots, real DOM
Auto feature tree — mermaid diagram of every page + sub-surface
Per-feature plans — happy / edge / error / permissions, with realistic personas
Cross-cutting plans — responsive design across desktop / tablet / phone + UX audit
Progressive results — exploration guide, feature tree, and plans appear as each step completes

Bench · create from brief

                                ▶ describe what to test

                                "Test suite for the underwriting portal at https://uw-stg.acme.com.

                                  Cover policy creation, broker lookups, claims triage, and the

                                  reinsurance workbench. Validate forms thoroughly."

Crawled: 12 pages · 4 sub-surfaces · 3 personas inferred (underwriter, broker, admin)

Generated: 10 feature plans · 95 cases · regression suite · responsive pass · UX audit

Ready to run — schedule against staging or kick a one-off from the UI

A vision-grounded case · login feature

// generated from a screenshot + DOM of the live login page

{

  "id": "TC-VIS-002",

  "name": "Admin logs in with valid credentials",

  "steps": [

    "Navigate to https://uw-stg.acme.com/auth",

    "Fill in email field with admin email",

    "Fill in password field with admin password",

    "Click the Sign In button",

    "assert_url contains \"/dashboard\"",

    "assert_visible `[data-testid=kpi-row]`",

    "assert_no_console_errors",

    "assert_response_time_under 2000"

  ]

}

The agent saw the page. Every selector it referenced exists. Every button label matches what a real user sees. When the page changes, hit Vision regen to refresh the case from the new state — no manual rewrite.

Pillar 2 · Cases that don't lie

Cases grounded in what's actually on the page.

Most flaky tests reference selectors that never existed — the agent guessed plausibly but wrong. The bench's vision-grounded generator takes a real screenshot + DOM extract of every feature page and only emits selectors it can see. Every case ends with explicit assert_* verbs so failures point at the real problem, not at a fuzzy text-match miss.

Vision-grounded — multimodal LLM sees the page, references real elements
Phase 6 assertion verbs — assert_text, assert_visible, assert_url, assert_aria_invalid, assert_no_console_errors, assert_response_time_under, …
Form profile sweeps — 1 happy + N negatives + per-dropdown-option, automatic
Re-grounding — Vision regen on any feature refreshes its cases from the current page
Append or replace — keep what you've manually tuned, or wipe and start over

Four pillars, one bench

Generation, execution, diagnosis, automation — the bench closes the loop that usually leaks effort at every step.

1 · Bootstrap from brief

Paste a brief, get a feature tree + regression suite + responsive + UX-audit plans. Crawl-grounded, not made up.

2 · Vision-grounded cases

Multimodal LLM sees the page, references real selectors, emits explicit assert_* verbs. Re-ground anytime.

3 · Failures explain themselves

RCA agent classifies every failure — test_design / product_bug / environmental — with a confidence score and a suggested fix.

4 · Regression watchdog

Schedule runs, compare against baseline, webhook on regressions with the RCA already attached. Real automation, not a cron job.

Built for every role on the QA loop

Each role gets the surface they need — and never has to read someone else's YAML.

QA manager

Bootstraps from a brief, reviews the auto-generated feature tree, edits cases by chatting with the agent, watches scheduled runs roll in.

Developer

Sees regression PR-style: this build vs last green build. RCA tells them whether to fix the test, fix the app, or check infra. No more spelunking.

On-call / SRE

Gets a Slack ping only when a regression actually breaks the user-facing path. Each ping carries the failed case, screenshot, and RCA.

IT / Platform

Docker-compose install. JWT auth out of the box. Per-env config: different URLs, creds, and behavior per staging/QA/prod. SSO when you're ready.

The engine inside

TestBench runs on InsightWorker

Every bench operation — crawl, case generation, test execution, RCA — runs through an InsightWorker agent on the same machine. The bench is a thin orchestrator on top: it decides which bundle to run with which inputs and persists the results. The heavy lifting (Playwright sessions, multimodal LLM calls, deterministic shell steps) is the engine's job.

That separation is why every primitive — playwright_steps, vision_generate_cases, failure-rca, test-suite-bootstrap — is a normal §15 InsightWorker bundle you can read, edit, and extend.

insightworker.com → Bench architecture

The bundles inside

test-suite-bootstrap — crawls the app, builds the feature tree + regression / responsive / UX-audit plans.

vision-case-generator — per-feature multimodal regen from screenshot + DOM.

test-executor — one browser, one login, walks every case sequentially in session.

failure-rca — per-failure diagnosis: design vs bug vs env, with suggested fix + confidence.

Each bundle is plain YAML + a manifest. Read them. Fork them. Add new ones for your own QA workflows.

Built for serious use, not a demo

The bench was designed for teams that actually rely on regression coverage — multi-environment, real credentials, real reports, real history. Not a screencast prop.

Multi-environment

Per-env base URLs, login URLs, post-login selectors, and per-persona credentials. Promote a suite from staging to production without rewriting cases.

Credentials stay yours

Bench stores only env-var NAMES. Actual secrets live in your environment — Vault, AWS Secrets Manager, Kubernetes secrets, plain env, whatever you already use.

Compare runs side-by-side

Every run is comparable to the last baseline for the same plan / persona / env. Regressions, fixes, still-passing, still-failing — classified automatically.

Schedules + webhooks

Daily-at or interval scheduling. Webhook on every run, only on failures, or only on regressions. Slack-shaped JSON payload with RCA already attached.

Evidence on every case

Per-step screenshots on pass AND fail, full execution log with timing, and the page's console errors at the moment of failure. Triage in one click.

Self-hosted

Docker-compose for a laptop or a single VPS. JWT auth + first-user-is-admin out of the box. Bring your own MySQL and provider keys.

Stop chasing flaky tests. Start shipping with confidence.

Most regression suites die slowly — selectors rot, waits get bumped, skip-lists grow. InsightTestBench keeps the suite grounded in the real app, explains every failure, and pings you only when something actually broke. Self-hosted in your environment.

Request a demo Self-host in one command

Bootstrap from brief · Vision-grounded cases · Self-explaining failures · Scheduled runs · Multi-env · Docker-compose deploy