Category: Agents

  • AWS Transform is the wrong tool for the job you actually have

    AWS Transform is the wrong tool for the job you actually have

    First off, I want to say I love Amazon Web Services, Kiro, and any effort that makes migrations from legacy to modern tech stacks. But, I also like the counter argument.

    AWS launched AWS Transform as “a collaborative enterprise IT transformation workbench powered by expert agents.” It promises to modernize your .NET apps 5x faster, shrink mainframe projects from “years to months,” and automate VMware migrations end to end. The marketing page claims 4.5 billion lines of code analyzed and 1.69 million hours of manual effort saved in the last twelve months.


    I’ve read the pitch. I’ve watched a couple of demos. I think most teams considering it should walk away, and I want to explain why —
    and what to do instead.

    What AWS Transform actual is

    Strip the agentic-AI gloss off and Transform is three things bundled together:

    1. A discovery and assessment layer that scans your existing estate (codebases, VMs, dependencies).
    2. A set of pre-built “agents” that perform specific transformations: .NET Framework → .NET on Linux, COBOL → Java, VMware →
      EC2, Java 8 → Java 21, and so on.
    3. A workbench (web console plus Kiro IDE integration) where humans review and approve agent output.

    It’s a continuation of a lineage: Migration Hub, MGN, App2Container, the old Microsoft Workloads tooling, the original CodeWhisperer transformation features. AWS keeps reshuffling these into new umbrella brands. Transform is the 2026 wrapper.

    That history matters, because it tells you something about the half-life of the product you’re betting on.

    My core objection: the output is shaped like AWS

    When you let an agent translate a COBOL batch job into Java, or a .NET Framework service into .NET on Linux, you don’t just get “modern code.” You get code that looks the way AWS’s agent decided modern code should look. The data access patterns it picks, the logging conventions, the way it splits modules, the runtime targets it assumes — all of that is now baked into your codebase, and none of it was a decision your team made.


    This is fine if you’re a hands-off shop that’s going to run whatever comes out the other end. It’s a disaster if you intend to own and evolve the system afterward. You will spend the next three years asking “why is it like this?” and the answer will be “because an agent decided in May 2026.”


    There’s a deeper version of this problem with mainframe and VMware work. The agent doesn’t just translate code — it picks the AWS-native destination. Step Functions instead of your existing scheduler. DynamoDB instead of “let’s think about whether this data actually fits a KV store.” Network conversion that assumes you want VPC-native everything. These are not neutral technical choices; they are commercial decisions made on AWS’s behalf, inside your repo.

    The metrics are a vendor-pitch, not a forecast

    “5x faster.” “70% lower operating costs.” “Years to months.”


    Every modernization vendor has said versions of these numbers for twenty years. They are real for the case study they came from. They are almost never what you will personally experience, because:

    • The 70% cost reduction usually compares licensed Windows Server + SQL Server Standard on owned hardware against Linux +
      an open-source database on Graviton. Most of the savings come from switching the license model, not from anything
      Transform does. You can capture that yourself.
    • The “5x faster” number is measured against a baseline of “team that has never done this before, doing it manually.” If your
      team has done a .NET migration once, your real multiplier is closer to 1.5x.
    • The mainframe-to-Java case studies almost always involve workloads that were already partially decomposed. The genuinely
      tangled mainframes the ones where modernization is actually hard are not the ones that ship as case studies.

    If a vendor’s headline metric needs four asterisks to be accurate, it isn’t really a metric.

    Modernization that skips understanding is just translation

    The thing I dislike most about Transform, and about agentic modernization in general, is that it lets you finish a project without anyone on your team understanding what they now own.


    When a senior engineer spends six months untangling a COBOL system to port it, the porting is half the value. The other half is that, at the end, someone in the building understands the system. They know where the landmines are. They can answer questions in incident review. They can tell product what’s safe to change.

    If an agent does the port, you get code on the other side and an organization that is no smarter than it was before. Worse: you now
    have a Java codebase that nobody wrote and nobody fully grasps, sitting on top of business logic nobody re-derived. The first production incident will be ugly.

    This is the same complaint people have about outsourced rewrites, and it applies cleanly to agent-driven ones.

    The Kiro and tooling lock-in

    Transform leans on Kiro, AWS’s IDE, with “pre-built playbooks.” Adopting Transform meaningfully means asking your engineers to learn Kiro, to work inside AWS’s review workflow, and to accept handoffs in a format that’s optimized for AWS’s agents to re-enter later.


    That’s a real switching cost. Two years from now, when AWS rebrands Transform into whatever comes next, those playbooks and that workflow knowledge depreciate fast.

    What to do instead

    I’m not arguing for “do nothing” or “stay on the mainframe forever.” Modernization is often the right call. But the right shape is almost always:

    1. Do the assessment yourself, or with a consultancy you’d hire anyway. The discovery piece of Transform is the least controversial part — but it’s also the part you most want to own. Knowing your own estate is a permanent capability. Renting it from an agent is not.
    2. Use general-purpose coding agents under human direction, not vertical modernization agents. Claude Code, Cursor, Copilot in agent mode these are genuinely useful for the grunt work of a migration (rewriting a thousand similar files, fixing a known refactor pattern, translating tests). The difference is that your engineer is driving, deciding the target architecture, and reading the output. The agent is a force multiplier, not a contractor.
    3. Small wins, not big-bang. Pick the highest-pain module. Modernize it. Run it alongside the old system. Cut over. Repeat. This is slower on paper than “let the agent do it all,” but it produces a team that understands the new system at each step. And you can stop whenever the remaining legacy stops costing you money — which, for a lot of mainframe workloads, is the honest answer.
    4. Separate the license/runtime change from the architecture change. If most of your savings come from leaving Windows + SQL Server, do that migration as its own project. Don’t let it get bundled with a re-architecture, because the re-architecture is where the risk lives and you want it isolated.
    5. Be honest about workloads that shouldn’t move. Some legacy systems are stable, cheap to run, and changed once a quarter. Modernizing them is a status project, not a value project. Transform’s marketing will never tell you this; a good architect will.

    TLDR:

    AWS Transform is well-engineered. The agents work. The demos are real. None of that is the question.


    The question is whether you want to end a multi-year modernization with a codebase shaped by AWS’s opinions, a team that didn’t learn the system, and a tooling dependency on a product line AWS will rename in two years.

    For most teams I’ve worked with, the answer is no. Use agents — yours, under your control — to make your own engineers faster. Keep the architectural decisions in the building. Skip the workbench

    Questions? Let me know.

    Don’t miss an update

  • An agentic Kanban Workflow

    An agentic Kanban Workflow

    I had this idea the other day and started building it into a project I’m working on. We all hate Jira but, the idea of Kanban is still a useful way to track projects.

    As we think about this in the AI era we could easily integrate the Jira MCP into a workflow but, once again we hate Jira. So that led me down a path of reinventing the wheel, mostly for my own purposes. I came up with this simple diagram:

    We will forever want to keep a human in the loop so a web interface is still likely necessary. However, a text based interface could also be cool…. Maybe in the future!

    What we end up with is a system of three agents:

    1. Developer agent – this agent writes and builds code in a sandboxed container based on specs written either by a human or by bugs found by the QA agent
    2. Build Agent – This agent monitors our build pipelines and if there is a failure it diagnosis why and opens a bug accordingly for the developer agent to fix
    3. QA Agent – Arguably the most important agent. This one will execute testing as close to simulating a human interaction with the software as possible. Upon finding bugs it would be able to log them back into the Kanban for the developer agent.

    Now we have a full DevOps life cycle with three agents. If I build this out, I become the user who is simply entering specs as features or bugs for the developer agent to work through. The code is still stored inside of some git based repository, build failures can utilize my already coded Build Failure Agent. Claude Code or Codex could function as the developer agent or we could run the whole thing on AWS Bedrock.

    Proof of concept coming some day when I have time!

    Don’t miss an update

  • The state of SEOScoreAPI – An OpenClaw Project

    The state of SEOScoreAPI – An OpenClaw Project

    If you remember, I did an experiment with OpenClaw. I let it have access to build anything it wanted with a single EC2 instance. It built https://seoscoreapi.com which is a fantastic tool for checking the status of your websites SEO.

    Initially I thought I would scrap the project and just let the domain expire. But I started doing some manual promotion after OpenClaw spent over $200 and got the first paid user! Since then I’ve put a few hours a day promoting and adding some features.

    What’s New

    The last few months have been the most productive stretch since launch. Here’s what shipped:

    ADA Accessibility Audits

    This one came from watching the news. ADA website lawsuits hit over 4,000 in 2025 and they’re still climbing. Small businesses, e-commerce sites, local restaurants — everyone’s a target.

    We built a full WCAG 2.1 AA compliance endpoint that injects axe-core (the industry-standard accessibility engine) into a headless browser and scans the rendered page. It returns a compliance score, a lawsuit risk assessment, category breakdowns across 10 areas (color contrast, forms, keyboard navigation, ARIA, etc.), and specific fix suggestions for every violation.

    We ran it on our own site first. Scored a 71. Found contrast issues, links that weren’t distinguishable from body text, a misused aside element. Fixed everything. Now we score 100. That’s the point — even developers who care about accessibility miss things.

    Available on all paid plans. Starter gets 5/month, Ultra gets 500.

    GEO (Generative Engine Optimization)

    Traditional SEO gets you into Google. GEO gets you into ChatGPT, Claude, Perplexity, and every RAG pipeline pulling from the web.

    The GEO audit checks 26 factors across four categories: crawl accessibility, structural markup, content extractability, and AI discoverability. It answers questions like: Do you have an llms.txt file? Is your content chunked in a way that RAG systems can ingest? Do you have freshness signals? Are AI crawlers even allowed in your robots.txt?

    This is becoming more relevant every month. Traffic from AI systems is growing and most sites aren’t optimized for it at all.

    Competitive Audits

    You can now audit your site against competitors in a single request. The response shows a side-by-side comparison with score differentials, category-by-category breakdowns, and which specific checks you’re winning or losing on. Useful for agencies pitching prospects and for anyone doing competitive analysis.

    What’s Next

    Honestly I’m not sure. I would love to get more people using the service. Possibly some more integrations, a WordPress plugin?

    Try It

    If you’ve read this far, go audit your site: seoscoreapi.com

    The demo on the homepage doesn’t require a signup. Type in your URL, see your score, read the priorities list. If you want API access, the free tier takes 30 seconds to set up — just an email and a verification code.

    If you’re a developer, the docs have everything. Python and Node.js SDKs are on PyPI and npm. The GitHub Action is at SeoScoreAPI/seo-audit-action.

    If you have questions or feedback, I’m at aaron@seoscoreapi.com.

  • Evolution of my build failure agent

    Evolution of my build failure agent

    I’ve written in the past about troubleshooting build pipelines with AI. While all of this is a great step in speeding up your development and reducing the amount of troubleshooting the DevOps team needs to do in the enterprise, it is NOT the end goal.

    The end goal would be to have the AI fix the problem for you.

    I’m rebranding my Jenkins Sentinel to just be Sentinel. This workflow allows you to automate remediation for your pipelines while still retaining human in the loop security.

    The other primary feature is storing your build failures and remediations in a database that you can view, update, analyze for custom model training.

    Originally we had the dispatch layer that would notify us of build failures and possible resolutions. The new addition is the cluster of “workers”. Running on AWS Fargate, this team of developers works with the LLM on Bedrock to resolve the failure.

    1. The task spins up in the cluster
    2. The build logs identify the repository and branch
    3. The repository is cloned, and branch checked out
    4. The code fix is implemented
    5. The task generates its reasoning and updates the database accordingly
    6. Code is committed to a new branch and a pull request is opened.
    7. The task cleans up and shuts down

    Dispatch still remains the same and the developer is notified accordingly. I need to implement developer specific notifications so that channels are not flooded or email lists abused.

    The other major thing I wanted to see was the cost per fix.

    This screenshot is from the dashboard which shows the compute spend and the LLM spend. For this simple Terraform fix you can see the was a little around $0.02. Assuming your code bases are more complex this value could increase proportionally.

    I also included a stats page which shows the totals for the entire organization.

    This is all real data from my testing project. The build agent is successfully troubleshooting pipelines for:

    • Python
    • Terraform
    • Java
    • Typescript
    • Docker
    • Kubernetes
    • Go
    • Cloudformation

    I plan to continue to add more supported platforms and languages as time allows. The other major integration that I am working on is support for GitHub Actions. Once I complete that integration and put this into all of my pipelines I expect that my troubleshooting and development time will decrease rapidly.

    Other future plans include:

    • Ingestion of bugs through sources like Jira, ToDoist (my favorite), or another ticketing system.
    • Discord Dispatching
    • Teams Dispatching – although this is really hard to develop for without a paid account
    • Custom model – using the build failure data to train a model

    Anyway, this project has been super fun. If you want to implement it on your own infrastructure feel free to reach out!

    Don’t miss an update

    PS: the featured image was generated and setup through my Nano Banana WordPress plugin

  • Why I built my own WordPress Platform

    Why I built my own WordPress Platform

    I’ve been building websites for a long time. I remember learning HTML and Microsoft FrontPage as my first website builder. It was such a fun time to be creating horrific looking websites back in the early 2000s. As the internet progressed so did my skills and back in 2016 I formed my company 45Squared to build websites for small businesses. My whole goal is to be your trusted resource when it comes to being online.

    When I started the company I built WordPress websites of various shapes and sizes but they always ran on AWS. This helped me expand my AWS skills as well as provide robust infrastructure for my client’s websites to live on. I managed the website and the underlying infrastructure for a small monthly cost that beat the competition. The result, a bunch of paying customers a decent side hustle.

    As time went on, selling became harder and the race to zero for cost was apparent. So, as the AI boom is on, I decided that it was time to automate the site building process.

    I started documenting out how I would want this to work. Fully automated website deployments, design, content, custom domains, good SEO base and deployed FAST!

    Enter https://ai.45sq.net. This platform is fully automated. The customer can provide inputs and descriptions of what they want as well as photos or other graphical content. The workflow takes all of the inputs and builds a fully functional WordPress website hosted on AWS. The user can easily point their own domain to the server and setup automatic payments. They then get full administrative access to their website so they can expand and add features just like any other WordPress site.

    So why did I build this?

    If you contact a web designer now you will have to pay them to build up the initial design, work with their timelines, end up with something that needs revisions and your time to live will be in the weeks not minutes.

    The platform I built for 45Squared eliminates the need for the initial design fees and focuses on getting you online quickly. Its great for small businesses who are just getting started.

    So now when I get a request to build a site I can tell the customer that I have two options. First, fully custom. I’m still willing to sit with you and build out the picture perfect website. Or, two, you can launch your own and I will still support the website and help you with your online presence.

    So that’s it. An easy to use WordPress website launcher. Running on enterprise grade cloud. With content, design, layout and all the rest handled by the magic of Claude Opus.

    Try it out: https://ai.45sq.net. No contracts. No weird fees. Get online today.

  • I built a WordPress Plugin For Generating Images With Nano Banana

    An image of a blogger who takes himself way too seriously for his own good.

    AI is every where. Accept it. Anyway, I had a random thought last night about having a WordPress plugin that allows you to generate images on the fly for your posts. Pictures increase engagement on posts so, what if we just inline Nano Banana directly into Gutenberg?

    This morning I built this plugin which is a simple API call to Google’s Gemini AI Studio through a Gutenberg block.

    1. Type your prompt
    2. Choose your model
    3. Hit generate
    4. Insert

    Simple!

    Nano Banana Image Generator block

    Once the image is inserted into the post it turns the block into a standard image block so its as easy to manage as any other image.

    I submitted the plugin to the official WordPress repository but it takes a while to get approved. So, if you want to add it to your own WordPress instance feel free to message me and I’ll give you access to the repository!

    Don’t miss an update

  • How I’m Using AI Agents to Run My Side Projects

    I think the promise of AI is to handle the work flows that maybe you don’t have time for. Or, maybe something that’s slightly out of your realm of expertise.

    I read an article the other day that had this quote:

    I keep waiting for someone to walk into my office and tell me what problems I should be solving with AI. Nobody’s come. (link)

    This mind set, in my opinion, is fundamentally incorrect. If you are looking for problems to solve there are a vast number of them and you’ll be drowned in possibility. For me, AI has always been about gaining efficiency or adding capability.

    If you haven’t noticed, I have a lot of side projects. I hope some day one of them hits the jackpot and allows me to retire to an island with all my family and friends. It’s unlikely, but hey, I’m allowed to dream.

    I realized long ago that I’m not great at marketing. I don’t understand lead generation, i’m not super amazing at SEO (so i built a tool for it https://seoscoreapi.com), I’m not in the least bit artistic (that went to my brother (https://mrbenny.co). But what I am good at is problem solving and process creation.

    I’ve been working on a concept of second brain for a while but I realized even a second brain needs to have tools. I started thinking about how to manage all of my side projects and how to interface with them through my preferred platform of Slack. (Sponsor me?).


    I’ve come up with a business development agent that can handle the things that I don’t particularly specialize in. On the backbone of Claude Sonnet 4.6, I created an API Gateway that takes input from my Slack instance and can handle a variety of tasks. 

    • SEO – Using my SeoScoreAPI it can handle generating SEO reports
    • Lead Generation – Using a variety of 3rd party API’s I have it looking for businesses that don’t have websites so I can pitch them some design services. Or, if their SEO is bad I can assist in fixing it
    • Lead nurturing – From the above leads, i get reminded that “Hey – you should connect with this person”
    • AWS Monitoring – It has read access into my AWS organization’s bills so that It can tell me if i’m over spending or just give me weekly overviews
    • WordPress MCP – Each of my managed WordPress instances has the MCP connected to it with read access so that if any of the sites have plugin upgrades, connectivity issues, errors or anything else I can quickly resolve them

    The result of this is quite simple, I’ve added another “worker” to my organization that can help me grow. These aren’t necessarily problems I was having. They are simply areas of work that I struggle with and, they have become easier now thanks to AI.

    If you think this concept is cool, I’ll be setting up a Terraform module soon over at https://aiopscrew.com

    Sign up for the mailing list to be the first one to get the details!

    Don’t miss an update

  • Building In Public – Designing an automated WordPress deployment engine with Terraform

    If you didn’t know I built a lot of technical experience by building out WordPress websites. I started a business around it. I did quite well. Lately, I’ve been trying to get back into building websites for small businesses. I find that the space of AI has made some of the bigger players too complex even though they claim to be easy.

    My goal is to create a single web page where a user can describe their website and have it auto provision a full WordPress instance build upon core AWS services. The end result will be a fully managed WordPress instance, backed by AWS’s 99.999% up time. It will be fully automated and take around 10 minutes to have a fully developed website.

    Current Architecture Plans:

    I’m utilizing AWS to handle everything (no surprise). Originally I was going to utilize Step Functions as the provisioner but as I started building I ended up hitting too many roadblocks and restrictions from a timing perspective.

    When dealing with Bedrock the response times can vary. So I made the switch to go with an ECS approach. The control plan/signup page is built there so the provisioner should also just be another task.

    Features:

    • Custom Domains
    • Automated SSL
    • Load balancers
    • Auto healing (through auto scaling)
    • Monitoring

    Essentially all the standard features you would expect from a web host. Just without the design portion.

    This will be a new ongoing series for you all to read about. If you’re interested in following along subscribe to my mailing list!

    Don’t miss an update

  • Letting OpenClaw control an AWS EC2 Instance

    If you are even remotely interested in the AI space you know about OpenClaw. If you aren’t familiar, OpenClaw is an AI Agent that runs on your computer. It can access various tools and you can communicate with it via a multitude of channels like WhatsApp or Slack.

    It was notorious for either deleting peoples information or being highly insecure.

    Anyway, as someone who is highly skeptical about all things new and AI, I watched closely but did not engage. Until now!

    I carefully installed OpenClaw in an isolated virtual machine on my home lab. I set it up so that I can engage with it via Slack. It has its own channel where we engage and it sends updates about what its doing.

    I gave it system prompt that explained who it was and what its goal was. I then created a new EC2 instance inside of a new untouched AWS account. No IAM role. No security group rules except to allow inbound SSH and outbound HTTPS.

    The goal? OpenClaw needs to reach a monthly recurring revenue of $100. It has explicit instructions that if it needs anything to achieve this goal it has to show me its plan and accounting.

    So what did it do? It toiled around with some ideas for a while about building different API’s or data scrapers. I told it to find a niche that it could really work with. It landed on building an SEO scoring tool that is API based. OpenClaw then requested I buy it a domain which I did, https://seoscoreapi.com. The domain is pointed at its EC2 instance where the API lives.

    So now what? OpenClaw built a fully functional API. It works great. You can use it for free. But how does this make money? Since I left to go to sleep last night, OpenClaw has been quietly emailing Web Developers and SEO Agencies while sending them their scores. It has created an entire scoreboard of site rankings based on its own API.

    It has also been adding itself to SEO tool lists on Github. Opening pull requests on its own.

    So, it’s been 24 hours since OpenClaw launched https://seoscoreapi.com. How much revenue has it created?

    None.

    How much have I spent in Claude Opus API Credits? $90.56.

    Was this experiment worth it? Absolutely. The power of AI is real. Configured properly and monitored it can build some really cool things.

  • Troubleshooting Jenkins Pipelines with AI

    Do you love or hate Jenkins? I feel like a lot of the DevOps world has issue with it but, this post and system could easily be modified to any CI/CD tool.

    One thing I do not enjoy about Jenkins is reading through its logs and trying to find out why my pipelines have failed. Because of this I decided this is a perfect use case for an AI to come in and find the problem and present possible solutions for me. I schemed up this architecture:

    ┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
    │    Jenkins      │     │   API Gateway   │     │   Ingestion     │
    │  (Shared Lib)   │────▶│    /webhook     │────▶│    Lambda       │
    └─────────────────┘     └─────────────────┘     └────────┬────────┘
                                                             │
                                                             ▼
    ┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
    │   SQS Queue     │◀────│   Analyzer      │◀────│   SQS Queue     │
    │  (Dispatcher)   │     │    Lambda       │     │   (Analyzer)    │
    │     + DLQ       │     │   (Bedrock)     │     │     + DLQ       │
    └────────┬────────┘     └─────────────────┘     └─────────────────┘
             │
             ▼
    ┌─────────────────┐     ┌─────────────────┐
    │   Dispatcher    │────▶│   SNS Topic     │────▶ Email/Slack/etc. 
    │    Lambda       │     │ (Notifications) │
    └─────────────────┘     └─────────────────┘

    A simple explanation is that when a pipeline fails we are going to send the logs to an AI and it will send us the reasoning as to why the failure occurred as well as possible troubleshooting steps.

    Fine. This isn’t that interesting. It saves time which is awesome. Here is a sample output into my Slack:

    This failure is because I shutdown my Docker Swarm as I migrated to K3s.

    Here is the same alert via email from SNS:

    So why build this? Well, this weekend I worked on adding “memory” to this whole process in preparation of two things:

    1. MCP Server
    2. Troubleshooting Runbook(s)

    Jenkins already has an MCP server that works great in Claude Code. You can use it to query jobs, get logs, have Claude Code troubleshoot, resolve and redeploy.

    Unless you provide Claude Code with ample context about your deployment, its architecture and the application it might not do a great job fixing the problem. Or, it might change some architecture or pattern that is not to your organization or personal standards. This is where my thoughts about adding memory to this process comes in.

    If we add a data store to the overall process and log an incident, give it unique identifiers we can begin to have patterns and ultimately help the LLM make better decisions about solving problems within pipelines.

    Example:

    {
     "PK": "FP#3315b888564167f2f72185c51b3c433b6bfa79e7b0e4f734e9fe46fe0df2d8c6",
     "SK": "INC#66a6660f-6745-468f-b516-41c51b8d0ecf",
     "build_number": 69,
     "category": "environment",
     "confidence_score": 0.65,
     "created_at": "2026-02-09T14:50:26.322599+00:00",
     "fingerprint": "3315b888564167f2f72185c51b3c433b6bfa79e7b0e4f734e9fe46fe0df2d8c6",
     "incident_id": "66a6660f-6745-468f-b516-41c51b8d0ecf",
     "job_name": "java-test-project",
     "relevant_log_lines": [
      "✗ Deployment to Docker Swarm failed",
      "ERROR: script returned exit code 255",
      "Stage \"Verify Deployment\" skipped due to earlier failure(s)",
      "unset SSH_AGENT_PID; echo Agent pid 2897883 killed; [ssh-agent] Stopped."
     ],
     "requires_human_review": true,
     "reviewed_by": null,
     "root_cause": "Docker service deployment to Swarm failed with exit code 255 during SSH command execution",
     "status": "suspected",
     "suggested_fixes": [
      "SSH into the Docker Swarm manager node manually and run: `docker service ls` and `docker node ls` to verify Swarm is operational and the manager node is reachable",
      "Check SSH connectivity and credentials by running: `ssh -i <ssh-key> <swarm-manager-host> 'docker info'` to ensure the Jenkins agent can authenticate and execute Docker commands",
      "Review the SSH agent logs immediately before the failure (lines truncated in output) - the actual error message from the `docker service create` or `docker service update` command is not visible in the provided logs",
      "Verify the Docker image `192.168.1.23:5000/test-java-app:69` is accessible from the Swarm manager node by running: `docker pull 192.168.1.23:5000/test-java-app:69` on the target host",
      "Check Docker Swarm manager node resources (disk space, memory) with: `df -h` and `free -m` as resource exhaustion can cause service creation failures with exit code 255"
     ],
     "updated_at": "2026-02-09T14:50:26.322599+00:00"
    }

    This is the log from the same incident I showed before. The same information now stored. So, now, if we hook up an MCP to our AI and ask it to find patterns we get something like this:

    Here's the summary of your 18 incidents across 7 jobs, all from today and yesterday:
    
      ---
      Category Breakdown
      ┌───────────────┬───────┬─────┐
      │   Category    │ Count │  %  │
      ├───────────────┼───────┼─────┤
      │ environment   │ 11    │ 61% │
      ├───────────────┼───────┼─────┤
      │ configuration │ 6     │ 33% │
      ├───────────────┼───────┼─────┤
      │ code          │ 1     │ 6%  │
      └───────────────┴───────┴─────┘
      Most Common: environment (11 incidents)
    
      The dominant failure pattern is Kubernetes deployment rollout timeouts — pods failing to become ready after kubectl apply or
      kubectl rollout status. This hits nearly every job:
    
      - accounting-cloud (4 incidents) — rollout timeout + health check timeout
      - oovoo-javer-bot (1) — rollout timeout, also noted k3ks-3 node is NotReady
      - dividend-tracker (3) — rollout timeout + health check wait timeout
      - aws-cost-optimizer (1) — rollout timeout
      - Password Generator (1) — rollout timeout
      - homelab-proxy (1) — deployment timeout
    
      The root cause is consistently the same: pods don't start in time. The AI analysis flags likely causes as ImagePullBackOff,
      CrashLoopBackOff, resource constraints, and the NotReady node (k3ks-3).
    
      Second Most Common: configuration (6 incidents)
    
      These are more varied:
    
      - oovoo-javer-bot (3) — missing Jenkins credential oovoo-javer-bot-env, missing K8s secret oovoo-javer-bot-secrets, empty
      withCredentials block
      - homelab-proxy (2) — Traefik CRD API version mismatch (traefik.containo.us/v1alpha1 vs traefik.io/v1alpha1)
      - dividend-tracker (1) — PVC resize forbidden (can't shrink PVC below current capacity)
      - food-cost-app-pipeline (1) — Terraform state lock held by another process
    
      Key Takeaway
    
      Your k3s cluster is the bottleneck. 11 of 18 failures are pods not starting in time. The k3ks-3 NotReady node suggests a
      cluster health issue that's likely causing scheduling failures or resource pressure across all your deployments. Fixing that
      node (or removing it from the cluster) would likely resolve the majority of these incidents.
    

    Ignore the fact that many of these are the same error. I did a lot of this on purpose.

    What you SHOULD imagine is how does this run inside your environment(s) and what data would you collect. If you think about it, you should realize you would find the bottlenecks of your own deployments. You would find the spots where your developers are getting stuck. You can then create solutions to those issues and hopefully reduce that trend line.

    Next steps.

    We need a human in the loop element. I’m going to start crafting a web interface where these issues are presented to a human engineer. That engineer could add notes or better steps for resolution. With that data added into the memory the troubleshooting agent can follow best practices of your organization or home lab.

    So, stay tuned for the web interface. If you’re interested in setting this up for yourself shoot me a message and I’ll give you access to the repository.