Skip to content

Hey! Some ideas from Windows-MCP that might be useful here #1

@kundeng

Description

@kundeng

Labels: enhancement, discussion


Hey there!

I've been designing some features for Windows-MCP (that Windows automation project with 2M+ users on Claude Desktop) and realized these patterns could be really valuable for CUP too.

Thought I'd share since you're building the cross-platform version of this — these ideas could work across all your target platforms, not just Windows. Here are four things I'm working on:

  1. WaitFor tool — so agents don't have to spam snapshots or use dumb fixed sleeps
  2. Advanced selector queries — fuzzy matching, regex, and smarter element disambiguation
  3. Return state on actions — cuts agent round-trips in half (with some caveats)

None of this is Windows-specific — it's all stuff that could work on macOS, Linux, Web, etc. Just thought it might be worth considering for CUP's roadmap.


Why this might be interesting

The problem

Right now CUP does full tree re-traversal on every snapshot() call, which is expensive (like 50-300 COM calls per window on Windows, similar costs on other platforms). Agents end up either:

  • Spamming snapshots after every action to see what changed
  • Using fixed wait(2000) sleeps and hoping the UI settled
  • Struggling to target the right element when there are multiple "Submit" buttons

What I'm designing for Windows-MCP

I'm working on specs for a bunch of stuff to fix these issues:

  • Condition-based waits so agents can say "wait until this button appears" instead of polling
  • Smart selectors with fuzzy matching and better disambiguation
  • Actions that optionally return the new state so you don't always need a separate snapshot call

(I originally explored event-driven incremental snapshots too, but tabled that — the UI tree changes too fast and too unpredictably for reliable cache invalidation. Turns out full re-traversal is the pragmatic choice for now.)

The cool part: Windows-MCP has 2M+ users already, so I'm designing this based on real production pain points, not theoretical problems.


The ideas (in more detail)

1. WaitFor Tool

The problem: Agents currently do this nonsense:

action("click", ...)
wait(2000)  # Hope the modal appeared?
snapshot()  # Check if it's there
wait(1000)  # Still loading?
snapshot()  # Check again...

My approach: Just let agents say what they're waiting for:

wait_for("element_exists", {"name": "Login", "control_type": "Button"})
wait_for("window_active", {"window": "Spotify*"})
wait_for("element_gone", {"name": "Loading..."})  # Wait for spinner to disappear

Uses events to wake up fast when the condition is met, with polling as a fallback. Works on any platform that has event systems (which is all of them).


2. Advanced Selector Query Schema

The problem: "Click the Submit button" — which one? There are three on this page.

My approach: Let agents be way more specific:

{
    "name": "Submit",
    "name_re": "Submit|Send|OK",      # Regex for variations
    "control_type": "Button",
    "automation_id": "btnSubmit",     # Stable ID when available
    "window": "Spotify*",             # Scope to a specific window
    "fuzzy": 0.8,                     # "Submitt" typo? Still matches
    "index": 0                        # First match if multiple remain
}

If exact match fails, auto-fallback to fuzzy matching and tell the agent what it found — including the closest matches so the agent can self-correct. This is basically how UiPath and Power Automate do it — proven pattern.

You can also scope searches to a specific window (glob matching), use automation_id for stable targeting when available, or use index as a last-resort positional disambiguator.


3. Return State on Actions

The problem: Every agent loop is like:

action("click", ...)  # Round-trip 1
snapshot()            # Round-trip 2 to see what happened
action("type", ...)   # Round-trip 3
snapshot()            # Round-trip 4...

My approach: Optionally return the state with the action result:

action("click", query={"name": "Login"}, return_state=True, settle_ms=300)
# Returns: {"status": "ok", "state": "... snapshot here ..."}

Cuts round-trips in half for actions with immediate effects. Optional parameter so it's backward compatible.

Caveat though: Some actions take unpredictable time to settle — opening an app, loading a page, waiting for a modal. A fixed settle_ms won't always cut it. For those cases you'd pair this with WaitFor instead. So return_state is best for quick actions (click a button, type text, toggle a checkbox) where you're confident the UI settles fast. For anything heavier, WaitFor is the right tool.


How this could work for CUP

You could do this in phases so each piece delivers value independently:

Phase 1: WaitFor tool (2-3 weeks)

  • Add wait_for() to the platform adapter interface
  • Implement for Windows (polling-based, with event hooks where feasible)
  • Add the MCP tool
  • Biggest bang for the buck — agents immediately stop wasting turns on fixed sleeps

Phase 2: Selector queries (2-3 weeks)

  • Extend cup/search.py with the advanced query fields
  • Add query parameter to action tools (alongside element_id)
  • Fuzzy matching + window scoping + diagnostic feedback

Phase 3: Return state (1 week)

  • Add optional return_state and settle_ms parameters to action tools
  • Document when to use return_state vs WaitFor

Total: ~6 weeks if you wanted to do all of it. But each phase works independently.


Why this could be cool for CUP

For users:

  • Agents get way faster (fewer wasted turns on polling and redundant snapshots)
  • More reliable (no more race conditions from fixed sleeps)
  • Easier to target elements (fuzzy matching, window scoping, diagnostic feedback)

For the project:

  • These are patterns designed against real production pain points (2M+ user base)
  • Cross-platform from day one — nothing Windows-specific in any of this
  • WaitFor alone would be a major differentiator vs other automation protocols

Prior art (this isn't new, just not in CUP yet)

| Feature | Windows-MCP (planned) | UiPath | Power Automate | Playwright | pywinauto |
|---------|-------------|--------|----------------|------------|-----------||
| WaitFor conditions | ✅ | ✅ | ✅ | ✅ | ✅ |
| Fuzzy matching | ✅ | ✅ | ✅ | ❌ | ✅ |
| Return state | ✅ | ❌ | ❌ | ❌ | ❌ |

Basically borrowing the best ideas from enterprise RPA (UiPath, Power Automate) and modern web automation (Playwright) and making them cross-platform.


Potential concerns

"Breaking changes?"
Nah, everything's additive — new tools, optional parameters. Existing code keeps working.

"This sounds Windows-specific"
It's really not. WaitFor is just polling + condition checking — works on any platform. Selectors are pure Python filtering on the tree. Return state is just "call snapshot after the action." None of this requires platform-specific hooks.

"Return state might be stale"
Yep, that's real. Some actions take time (opening apps, loading pages). That's why return_state is best for quick actions, and WaitFor handles the slow ones. They complement each other.

If this seems interesting, let me know. just thought these ideas were cool and might be useful for what you're building!

Cheers

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions