- contact@verticalserve.com
Paste a sentence about your app. The bench crawls it, builds a feature tree, generates a regression suite grounded in real selectors, runs it on a schedule, and turns every failure into a structured diagnosis your team can act on.
Every team starts with the same plan: ship the feature, then add automation. By the third release, the test suite is a graveyard of flaky cases referencing selectors that don't exist. The bench fixes the loop end-to-end — from generating cases that match real selectors to explaining failures when they happen.
assert_* verbs (text, url, aria_invalid, no_console_errors) instead of fuzzy expected-text matching.Paste a sentence about your app and the URL it lives at. The bench logs in, crawls every menu and sub-page, captures a screenshot of each, builds a feature tree, derives the personas, and generates a per-feature regression suite — plus a responsive-design pass and a UX audit. You watch it happen on the project page.
Most flaky tests reference selectors that never existed — the agent guessed plausibly but wrong. The bench's vision-grounded generator takes a real screenshot + DOM extract of every feature page and only emits selectors it can see. Every case ends with explicit assert_* verbs so failures point at the real problem, not at a fuzzy text-match miss.
assert_text, assert_visible, assert_url, assert_aria_invalid, assert_no_console_errors, assert_response_time_under, …Generation, execution, diagnosis, automation — the bench closes the loop that usually leaks effort at every step.
Paste a brief, get a feature tree + regression suite + responsive + UX-audit plans. Crawl-grounded, not made up.
Multimodal LLM sees the page, references real selectors, emits explicit assert_* verbs. Re-ground anytime.
RCA agent classifies every failure — test_design / product_bug / environmental — with a confidence score and a suggested fix.
Schedule runs, compare against baseline, webhook on regressions with the RCA already attached. Real automation, not a cron job.
Each role gets the surface they need — and never has to read someone else's YAML.
Bootstraps from a brief, reviews the auto-generated feature tree, edits cases by chatting with the agent, watches scheduled runs roll in.
Sees regression PR-style: this build vs last green build. RCA tells them whether to fix the test, fix the app, or check infra. No more spelunking.
Gets a Slack ping only when a regression actually breaks the user-facing path. Each ping carries the failed case, screenshot, and RCA.
Docker-compose install. JWT auth out of the box. Per-env config: different URLs, creds, and behavior per staging/QA/prod. SSO when you're ready.
Every bench operation — crawl, case generation, test execution, RCA — runs through an InsightWorker agent on the same machine. The bench is a thin orchestrator on top: it decides which bundle to run with which inputs and persists the results. The heavy lifting (Playwright sessions, multimodal LLM calls, deterministic shell steps) is the engine's job.
That separation is why every primitive — playwright_steps, vision_generate_cases, failure-rca, test-suite-bootstrap — is a normal §15 InsightWorker bundle you can read, edit, and extend.
The bench was designed for teams that actually rely on regression coverage — multi-environment, real credentials, real reports, real history. Not a screencast prop.
Per-env base URLs, login URLs, post-login selectors, and per-persona credentials. Promote a suite from staging to production without rewriting cases.
Bench stores only env-var NAMES. Actual secrets live in your environment — Vault, AWS Secrets Manager, Kubernetes secrets, plain env, whatever you already use.
Every run is comparable to the last baseline for the same plan / persona / env. Regressions, fixes, still-passing, still-failing — classified automatically.
Daily-at or interval scheduling. Webhook on every run, only on failures, or only on regressions. Slack-shaped JSON payload with RCA already attached.
Per-step screenshots on pass AND fail, full execution log with timing, and the page's console errors at the moment of failure. Triage in one click.
Docker-compose for a laptop or a single VPS. JWT auth + first-user-is-admin out of the box. Bring your own MySQL and provider keys.
Most regression suites die slowly — selectors rot, waits get bumped, skip-lists grow. InsightTestBench keeps the suite grounded in the real app, explains every failure, and pings you only when something actually broke. Self-hosted in your environment.
Bootstrap from brief · Vision-grounded cases · Self-explaining failures · Scheduled runs · Multi-env · Docker-compose deploy