Presentation: AI-First Software Delivery: Balancing Innovation with Proven Practices

Transcript

Wes Reisz: My current project, I am on what is called an AI-first software delivery project for Thoughtworks. What that means is we're shifting left AI in all we're doing with the SDLC. What I'm going to be talking about today primarily is focused on coding agents and how we're leveraging them. Before I get into that story, I should back up a bit. At Thoughtworks, we're a consultancy. We typically go into companies, partner with teams. We use their infrastructure, we use their environment, and we build software there. I'm challenged with shifting left AI with this team that we're in. The team's about 16, it's not a very large team. I'm using the infrastructure that's present with the client. I'm using a lot of the systems that are there. A lot of times we have to uplift the client.

All the time, I'm constantly asked why I'm not using a full multi-agentic approach with how we're delivering software. What this talk is really about is how I answer that question. Specifically in the context of meeting the client where they're at, meeting customers where they're at, understanding our level of domain knowledge with the environment that we're operating in and the tools that we're working on. This talk is specifically about that particular journey and how I try to answer that question.

Background

My name is Wes Reisz. I am a technical principal, technical partner at Thoughtworks. Again, what that means is I get to change clients all the time. It's a lot of fun. I usually go about 9 to 12 months with a client. About three months ago, we picked up a new client. This particular one is a large state in the U.S. What we're doing is building a knowledge graph. We're looking at different rules, regulations. We're ingesting those rules and regulations and building a knowledge graph with a deep research agent to be able to allow some of their systems to be able to interact with a modern UI, AI type interface. That's what we're building. To do that, we're using an AI-first software delivery approach. We're shifting AI as left as we can when it comes to building. What that means is we're using specifically Claude Sonnet 4.5 with Cursor, in my team to be able to use an approach that's very AI first. I'll go through some of that here today.

This is the entire talk. AIFSD, AI-first software delivery is not a one-size-fits-all. Choose your approach based on a few things. The two things that I'm going to talk about in particular are code longevity and automated verification. These are two axes within a two-by-two that I'll use to help answer this question of why I use a supervised approach with coding agents and not an unsupervised multi-agentic approach with the team that I'm at for the client that I'm working with now.

The second key takeaway is use a structured approach when working with LLMs. Has anybody ever heard of RIPER-5 by chance? RIPER-5 is an approach that you can put the LLM in a partnering mode, by giving it different instructions. I'm going to go through RIPER-5, and I'm going to specifically show how we're partnering, how our teams are working with the LLM and not stepping back and waiting till after the code has been written and only looking at a PR. We're using our practices that Thoughtworks believes in which is pair programming. We're using continuous delivery. We're partnering with the LLM in this RIPER-5 approach. I'll talk about that. Then, last but not least, AI doesn't replace engineering discipline. You've heard that through talks throughout the entire day. What it does though is amplify your practices. If you have bad underlying foundations, AI will absolutely amplify those. If you have good ones, it'll amplify those as well. These are the key takeaways. This is what I'll be talking about today.

Outline

The agenda is specifically about considerations for AIFSD. This is that two-by-two model that I was talking about. How I try to answer that question is, why aren't you using this approach, or why are you using that approach? My goal for this talk, regardless of where you're at in your journey with AI, for you to be able to leave, go back to your shop and be able to apply techniques that I'm talking about. Whether that's you haven't touched agentic development or whether you're building agentic solutions today. I want you to be able to take some things back to be able to take to your shop. I will talk about engineering rigor, engineering discipline coupled with AI. Then I'm going to put it all together and show implementing an MCP server using the RIPER-5 approach that I specifically talked about.

Considerations for AIFSD

Considerations for AIFSD. You cannot be anywhere right now in the software industry and not see slides like these. This is an IDC slide that says by 2029, 26% of the worldwide IT spend will be spent on agentic AI. 26% is $1.3 trillion. I don't know about you, but I can't get my mind around what $1.3 trillion looks like. I just can't actually think of something that size. The way I try to put this in perspective, if you think of a $100 bill, just a simple $100 bill U.S., it's about 6 inches large.

If you were to take that $100 bill and stack them end to end, $1.3 trillion would circle the earth 50 times. That's a lot of money being spent on agentic AI. Have you seen MIT's report that says 95% of AI projects fail to deliver on ROI? We're spending $1.4 trillion on it, yet 95% are failing? What's going on? What is happening if this is the case? What I would propose is that we're not always modeling, we're not always mapping our use of agentic AI to where we are as a company, or we're meeting our client, our company, where we're at when we're applying these solutions. What I did is I put together a two-by-two model to answer that question that I described before.

I'm going to show that, but before I do, I want to acknowledge the fact that when you look up here, you will disagree with something. I promise you, you will. I disagree with something that's up here. That's ok, because models are nothing but a map for us to have a conversation, to be able to establish common dialogue. My map, my model that I use is this. This is a simple two-by-two, and on the y-axis you see longevity. How long will this code live? Is it short-lived, or is it long-lived? Is it going to be in production? Is it going to be facing the customer for some period of time? Down on the x-axis, I talk about degree of automated verification. Specifically, can you verify the thing that you're writing? If you can't, we need to do things in a more supervised approach.

Once you've established domain knowledge, you can do things more in an unsupervised approach. Let's walk through these different quadrants here. In the bottom left is exploratory development, aka vibe, the misquoted vibe coding that came from March? It seems like it's been around forever. March of 2025, Andrej Karpathy said this now famous tweet that's misquoted, but it talks about vibe coding. I vibe every single day.

If I'm talking to a client, and they're telling me about a problem, I let them talk to my product person for a few minutes, and then I start trying to put a solution together. Then I go, is this what you meant? The best answer they can give me is, "No, not at all. That's completely wrong. Please don't ever do that again". Great, I'll put it away, and I won't do it again. I got great information about it. I was able to get immediately what they're looking for or not looking for in that particular case by just doing a simple experiment. POCs, doing a small R&D, all these things I use just to be able to understand the scope of something that I'm doing. I do it every single day. Do I put it in production? No, of course not. I use it as a way of gathering information, looking at data maybe, or spinning some information in a way that I can do it. It's a valid approach for me to be able to gather information, but it's not how I go about delivering software in a client environment.

In this case, it's short-lived, and I lead it. I'm very human-centric. That's that lower left quadrant. Let's move to the right across that bottom for that degree of automated verification heading on the y-axis into domain sensing. These are things where I use deep research agents. We've all gone out there and used one of the LLMs to be able to do deep research. I can do those outside the firewall. I can also do them inside the firewall to understand legacy codebases, to understand a problem when I'm coming into a client cold to be able to understand what I'm coming into. What are their boundaries? How do they think of domain-driven design? Where are their seams? Where can I introduce seams? What types of patterns do they use? I can use background domain sensing agents in those particular cases to understand the environment that I'm working in so that I can begin to make choices. This is information for me. It's not necessarily going to be around a long time other than the fact I'm gaining information with this. That's degree of automated verification. It still needs to be safe, so we have high degree of auto verification.

Before we move to that top left quadrant, this is the area that I'm going to primarily be talking about today in this talk. This is about supervised and unsupervised coding agents in particular. On the team that I'm working on now, we're specifically focused on supervised coding agents. Why? Because three months ago, I put this team together. We came to a client to solve a particular problem. We know very little about the domain. We have to understand how to evaluate the domain before we can start to really build autonomous coding agents that go off and do things. We have to be able to understand how to do it safely, put guardrails around them, how to establish identity to those things. All these things have to be done so we take a very supervised approach to it. As we've now started to gain domain experience, we're moving to the right with unsupervised agents. We're beginning to do background tasks with simple code validators, doing things like checking our specifications that I'll do. We're starting to build these things.

Presentation: AI-First Software Delivery: Balancing Innovation with Proven Practices

As we go forward, we'll be able to move more and more into unsupervised agents. The key that I'm trying to get here is that AIFSD, how we approach delivering software, isn't a one-size-fits-all solution. Just because everything isn't an unsupervised agent doesn't mean you're not using AIFSD. I put this quadrant together to understand different tools, different techniques that you may have in each one of these particular areas. This is how I try to answer the question about why aren't you using a fore agentic loop in building your software today? I answer it because I don't have enough domain knowledge to build the degree of automated verifications. I need to do it safely. As I gain that, I'm able to do more of it. The bottom line, the TL;DR here is AIFSD is not a one-size-fits-all solution. You fit the solution to the domain that you're working in. Choose your approach. The two things that I use, there's other ones that you could potentially use. As I said, you may use other things on the x and y-axis, but understand what are the things that you want to do. What are the options that you have? Then pick the approach that works best for your particular environment.

Engineering Discipline and AI

The next thing I wanted to talk about is engineering discipline and AI. One of the things that's missing when we work with LLMs is the predictability. To be quite honest, if we had predictability, if we had determinism, why would we use them in the first place? The whole point of it is that it is not giving us the same thing every time. If it was, we'd call it a pure function. We'd give it something, and we'd just get the same output every time. The very fact that an LLM can give us something that is non-deterministic in nature is the power of what we're doing. We need to be able to map that up into a structured way for us to be able to deliver in something that is repeatable, even if non-deterministic.

The first thing that I'll talk about is spec-driven development. Everybody's coalesced around spec-driven development these days. A colleague of mine, Birgitta Bockeler, out of Germany, she wrote a blog post on martinfowler.com, where she dove into some of the spec-driven kits that are out there, like Kiro, Spec Kit, Tessl. These are all definitely ones that you can use to be able to adopt. What I'm going to approach today is just simple Markdown files with specification that you can use at your shop Monday morning. This is nothing special that you can do. You absolutely can use tools like these, but you don't have to to get started on an AIFSD approach.

In this blog post, she also talked that when we say spec-driven development, we actually mean different things. There are three different approaches, spec-first, spec-anchored, and spec-as-a-source. Spec-first is where you create a spec and generate the code from it. Spec-anchored is you create a spec and it informs the creation, you keep the spec around. Spec-first, you get rid of it. Spec-anchored, you keep it around and then you use it to inform the code that you're writing. Then spec-as-a-source, the code is a byproduct. You're only actually updating the spec-as-a-source. I haven't had tremendous success personally with spec-as-a-source. I use spec-first where we generate it and then code is the thing that we produce. That's the approach that my team currently uses.

I did a podcast with a couple of engineers that are speaking here at QCon AI, and Prince Valluri from LinkedIn made this comment about the spec that really resonated to me. He said, a spec is the contract between the developer and the LLM. That made total sense to me. If you're going to be interacting with an LLM to generate code, the spec allows you to define that contract. Clearly, the boundaries are what you're looking for, what you're trying to get from this particular execution with the LLM. Good contracts need good rules. What do you put in a spec to make it good? This goes back to December of last year. These are a couple of the founders for Databricks, Ion Stoica and Matei Zaharia. Matei actually spoke at QCon AI a few years back, the last one that we had. What they talk about in this particular paper is some of the things that should go into a specification to make it really useful. There are five things in particular that I try to focus on. I don't use this on every single spec, but I try. This is what I attempt to do with each of my specifications.

The first one is a proof carrying output. What can you put in a spec that it can verify itself, that it can say that it is correct? For this, I do think like end-to-end tests. When I create something, I'll have a test that gets created that end-to-end validates what the actual specification did. Step-by-step verification, the first part is step-by-step. One of the things we know as software developers is to break problems down into smaller pieces. What this tells me is break things down into smaller problems and then verify as you go through. What I read when I see this is steps and then BDD. I introduce behavior-driven development into our specification so that we can actually test, before we actually write the code, the behavior that we're looking for. Why do I say it that way? You've all worked with an LLM and you've had it generate tests. When you have it generate tests after the code is written, it tends to fit the test to the code and they become incredibly brittle.

If you start with things like BDD up front in your specification, you've done that before. It's test first. This is the approach that we're trying to use to some effect. Execute and verify is the verify step. Once you create the code, verify your outputs are doing what it's supposed to be doing. Pre-conditions, post-conditions, very familiar to us. What must be true for this to start? What must be true for it to stop? Then statistical verification is these things are non-deterministic. We've all heard the stories even today that run it six times, you'll get six different outputs. That is true. How do we still deal with validating what we have when we get this type of output? Statistic verification are things that we can add. That's one that I don't tend to do a lot of. The first four, I do quite a bit of. These are things that I try to put in our specification to give us a more repeatable behavior when using an LLM to help us in AI-first software delivery that generates code with coding agents.

You've all written code where you wanted to do something really simple and you're working with an LLM. Because your rules files, because of maybe other things that you have in place, it goes really deep. You get layers and layers of abstraction, and you're just trying to understand the problem space or just trying to do something really simple. It's taking you too far. What's missing is being able to put the LLM into the mode that you're in, the mental model that you're in when you're working with the LLM to put you both on the same level when you're interacting with it. What we use to help us with that is something we call RIPER-5. Actually, we did name it. It comes from a blog post around earlier this year, March 2025, someone named robotlovehuman, whoever that is, posted this out on the Cursor forums. This is something we picked up. What RIPER stands for is Research, Innovate, Plan, Execute, and Review. What we do with this approach is we take a specification that we just talked about. We put our well-defined things that are in it that I talked about, the five different things at the bottom that we want to do.

Then we go into a research mode which is basically a way of passing instructions to the LLM that says your goal is to gather context right now, read files, but don't code. Your job is to really focus on understanding. Ask me clarifying questions about what needs to be done so that I can provide feedback and I update the specification for it. That's research. From the research phase, it goes into innovate, and that's where you say, give me three options. How might I go about implementing this? Then you use those three options to be able to pick one, add that back into your specification.

From that, I go into plan. Plan is where we take that specification, break it down into individual tasks just like you would prior to AI and a scrum team with your team, and saying, here's a story. What are the different things I need to do to implement this so that anybody could pick up that story? You task it out, but you use the LLM to help you do that. Once you've tasked those out, you plan each of those tasks, you execute, and then you review. That's the RIPER-5 model. I'm talking about doing this in a supervised mode. I'm specifically talking that upper left quadrant where developers are pairing with the LLM to do this. This process could absolutely be used when we shift to the right on that using an unsupervised approach. This is the model that we follow in putting the LLM into modes that we're operating at, whether that's with a background agent or whether that's in the foreground.

What's as important here is what you can do with RIPER-5 are the things that you're forbidden to do. This is actually a really big, important point. When you're in this research stage and you're gathering information, you're understanding the codebase, you're having the questions asked of you to provide feedback, you also want to forbid it what it can't do. It cannot suggest, it cannot plan, and it cannot code. At this point, all you're doing is research. Innovate, you're forbidden to plan, do decisions or code. In the planning, that's when you are forbidden to implement code. It's only when you get to execute when you actually write the code. In that point, you're forbidden from deviate. What's interesting about this is you see the tools today like Cursor actually adopting these too. They've introduced debug mode.

About a month before that, you formally saw plan, to plan and act. That's what this is doing. It's putting the LLM in that mindset, so that it's working with you as you're interacting with it. It's not jumping ahead and creating that fully abstracted solution where you've lost the solution in the weeds that's actually being implemented. Then, finally, in the review, you're forbidden from skipping checks. This is where you validate that what was created is what the plan actually was. This allows you to look for drift and say, this is what you wanted to do, that contract. This was your contract. This is what actually got implemented, and what's the difference. That's what review gives us. That's the RIPER-5 mode that we leverage on our team.

What does that look like in practice? It starts off with the rules file. Again, for this particular project, we decided to standardize in Cursor. The examples you'll see here are mostly in Cursor, but these completely apply to Claude Code or to Windsurf, or any of the other tools that are out there. It just may be AGENTS.md versus the rules file that are here. Up here, this is my rules, my commands. This is in a submodule that I have and I basically have the team check it out. It's available so that we can share it. You can also share these things inside of the Cursor environment, but we didn't want to specifically do that because we might have wanted to shift at some point to something else. This is where we set things up.

The next part is where we develop our specification that we're going to actually implement. From there, we go to that research that I showed you. Research is where we ask questions. Innovate is where we go through about different options for the implementation.

Then plan and task is where we break down that specification into individual tasks and then plan those tasks. Then from there, we execute and then we review. That's the process that we follow with RIPER-5. Nate Schutta is one of my colleagues at Thoughtworks. I showed him this as doing a prep for the talk. He said, you know what really resonates with me about that is that process that you just showed. I don't know that it has anything to do with AI. That's how I work anyway. That's exactly the approach that I work when I'm doing a specification. That's exactly how I work when I'm writing code. I understand the story. I break it down. I do some research. I do some innovation of how I might do this. It may not be a formal step, it's just something that I'm doing.

Then I break it down into what I need to do, and then I execute it and I check it. That's how I work. What I really like about this RIPER-5 is it codifies the way we already work. It allows us to do things to be able to work with an LLM or even build multi-agentic systems that leverage the same type approach.

We talked about RIPER-5. We talked about the four quadrants, about understanding where you're at. The next piece is what I said at the very beginning, that AI is a powerful tool. It amplifies your practices, good or bad. You need to make sure that your foundation is solid, and if it's not, you can cause a lot of troubles and issues. For example, we can generate code much faster than we can actually review. Things can scale exponentially with defects if we're not putting engineering rigor into our process. Architecture boundaries can leak because maybe we don't have the evaluations in the system enough to verify that we're having good boundaries between the systems that we're building and we get leaky abstractions. We need to understand enough of the domain to be able to make sure we're building the system right before we just build on purely based outputs. You can get into traceability issues in regulated environments, things like that. You want to be able to have guardrails to be able to put things in place. I work for Thoughtworks.

One of the things that we practice is something called our sensible defaults. Sensible defaults is whenever we set up one of these accounts and we go in with a new contract, we start with these defaults. This is our starting point. We may not use all of this because we're meeting a client where they're at. They may have some kind of environment that doesn't allow us to do continuous deployment. This is our sensible defaults. This is where we start from. Everybody shares these same beliefs. There are things like continuous integration, test-driven development, pair development. Let's talk about pair development. I get asked a lot, why are you pairing? The LLM is your pair. No, the LLM generates output based on the average of what it's trained on. The LLM is a tool that we use to help us. Also, what we're finding is that when we generate code, it is exhausting reviewing the amount of code that can get generated. The specifications are like 5 to 1 in the amount of specification and planning that goes into the code that actually gets written.

If you're doing all of that, it's exhausting reviewing it all. Pairs help us get through that to make sure that we're reviewing things correctly and we're getting the right output that we're doing. We also use trunk-based development. We're shifting that quality left and making sure we commit to main. Those type of practices, pair programming is what we continue to practice.

Other things there, building security to the left, automating builds, deployment pipelines, I mentioned continuous delivery, managing debt, and then of course, building for production is our goal. All of that gives us these things for fast feedback, repeatability, simplicity, and ultimately can tie directly to business metrics like the DORA metrics that we're all familiar with, MTTR, deployment frequency, lead time, and change failure rate. The second thing here is that AI doesn't replace engineering discipline, it amplifies it. What are your sensible defaults? Whatever those are, we have to continue to invest in them. We have to make sure whether it's a supervised or unsupervised agentic workflow, that we're building this on solid foundations so that we amplify those foundations.

Put it Together: Implementing

Let's put this together into an example. This is a workshop that I did at QCon London back in March. What this basically did is after every one of the talks during the conference, it would upload into S3, run a set of Step Functions that would transcribe the talk, break it down, put it into a vector database. It basically created naive RAG. You could ask questions. It was just a simple tool call. In this case I used ChatGPT, created a simple tool call where you could go in and say, what were the key takeaways from the platform engineering talk that was earlier today? You can see here, it makes a simple call. It's just a simple tool call, it's a simple API call. All it does is go back, it's a dense retriever. Types, sends this thing that you're seeing here off to the server, comes back, passes email, API keys, and things like that. Dumps that back into the LLM and gives me back a response. Very simple little tool call. Works great, but it's a one-shot. It does a single call that goes against the vector database. Very nice thing for QCon. What I wanted to do for this little demo is, let's take this and put it into an MCP server.

As an MCP server, now we get multi-turn. Now I can actually give a tool over to the LLM and say, here is a tool that you can get information about QCon to answer questions for it. Let's do that. I'm going to use this RIPER-5 approach to be able to walk through this and show you what that might look like. First off, this is just the README file. It has a ton of information because it's a public repo shared. It has how to set up the MCP server. You can see in Cursor how to configure that. Here at the bottom, it talks about how to set up a dev environment. There's some AIFSD principles that we try to follow. This talks about the RIPER-5 stuff that I just showed you a few minutes ago.

Then, here, we'll go down and we'll show how inside Cursor you can set this up, so that Cursor will operate with these RIPER-5. It gives you that planning type mode, but with each of the five steps. It also goes through here and talks about how you can do that submodule, set up rules, commands. These are things that, again, you probably saw in some of the talks about how to configure some of these things. This walks through some of the particular steps. That is the README file.

Let's jump ahead now and actually look at a specification. If we go into the specification, I used a free Jira account and named it SCRUM-5, whatever. This is the specification that I created. You can see architecture components. You can see acceptance criteria. Notice I did not use BDD here. Definition of done. Implementation tag. This is information that when I went through the research and innovate stage, I added to the specification based on the questions that were asked to me. All this is, is a simple Markdown file. That's all this is. It's capturing what I intend to do and establishing the contract of what I actually want to build. From here, now I want to go through and break this down into the tasks that I want to actually do. I've done research. I've done innovate. Now I want to go into plan.

The first step in doing that is to break these down into tasks. If you look here, it's just a file that lists each of the tasks. There's configuration. There's setup for the API. There's MCP server setup. There's testing. There's documentation. You see some of the dependencies for each one of these steps that are going to go through. Some different milestones that it created. A critical path. Success criteria. All this is the stuff we do intuitively when we're actually writing this. This is just breaking it down into a process that can be followed by an LLM and then ultimately by an agentic loop.

Once we've done that, and we've broken these down into all these individual tasks that we want to do, you go into a planning stage. That planning stage plans out what's going to be done. Remember, I said that in this stage we're using a supervised approach, which means my developers review this, verify what's here, and that first step up there in task 1, for example, this was a Python project. It didn't do a virtual environment. That wasn't part of anything that we specified. Our developers adjust the plan and say, no, I want a virtual Python environment before we get started and then have it start again. This allows us to put feedback and control and own each of the steps or the tasks along the way, even though we're using an LLM to be able to generate the code. From here you see the code that's actually generated.

If you look, of that code that's there, there's about 5, 6 files. Those are about 50 lines each. There's very few code that was actually generated. It's about 10 to 1 on the amount of things that were specification and planning versus the amount of code that's generated. I think this is a not-so-subtle point. Software is not just about creating for loops. It's the thinking that has to go in to create this MCP server so that I can actually pull this information in. It's about 10 to 1 on the code that's actually generated. LLMs do amazing things, but just generating fast code is not going to help us in creating production caliber software.

What does this look like, if we put it into it? This is Cursor. Up here I've registered the MCP server. Now I'm asking that exact same question that I did before. What are the key takeaways from the platform engineering talks at QCon London 2025? I went ahead and told it to use the MCP server that I just created just for clarity. It goes into thinking mode and it says, let me get some information. It could have done a better job of dumping that there to the screen. It goes through and does the first turn, asks another question. It's thinking using this tool. I only had one endpoint here, but there's other ones I could have added. It's going through here asking these questions. It got the platform talks first, then it asked for the key takeaways. This one's asking Lesley Cordero's talk about scaling organizations. This one is Rachael Wonnacott talking about autonomy and fit for platform engineering teams.

Then now it's like, I have enough information. I can put this together. Now it assembles the things and puts this back together. It's an MCP server that's actually implemented with RIPER-5 in here. This example is a toy. It's for a QCon that was there. Depending on where you are in your journey with AI, imagine if this was your architecture decision records. Imagine if this was the PRs that have been sent to the system. What if this had engineering context about what you're trying to build, and you're making this available to developers? I'm doing a migration, upgrading Java, something like that, and I want to see what were the files that were last touched. I can look at those things, validate what's there. Not only can I do that, as I start to move into those fully autonomous workloads, I can use those with my agents to have shared context. These are some of the foundations of memory that you can build with just a simple MCP server like this. That's what a tool like this can give you with this type of approach, all using RIPER-5 across that top line of that two-by-two.

Approaches to AI for Your Team

What did I talk about? I talked about considerations for AIFSD. I talked about engineering discipline, and that AI amplifies your engineering discipline that you're using. Then I put it together with an example of MCP server. That's what we went through. What does this mean for your team? If you start at the bottom, engineering discipline has never been more important. The things that we've built over the last 25, 30 years in software, we continue to need, we have to have that. If we don't, the pace that we're able to implement things today can cause dramatic issues for us if we're not building on solid foundations. Continuous delivery, pair programming, trunk-based development, continuous integration. These are practices that continue to be important, never more important than it has been today. I showed you a two-by-two model that used how long the code's going to be around, longevity, against your ability to do verifications, specifically around the domain knowledge of what you understand.

Then different tools and techniques within AIFSD that fit into those different quadrants. What is it your enterprise looks at? What are the tools that you have available? What's important for you? I use that model to be able to describe when and how I go into a supervised or unsupervised approach with coding agents with my teams. Then I use something called RIPER-5 to be able to implement AIFSD. This puts a very rigorous process using the LLM in a mindset that engineers can relate to so that we're operating the same way and the LLM is working with us. This is how I would recommend going forward. On that choose the right AI mode that you're working in, it's not a one-size-fits-all.

Today, the teams that I'm working with have only been operating together for three months in this enterprise, so we're taking a very supervised approach. We're starting to implement unsupervised agents now to be able to do things like check our specification. The speaker from Qodo talked about validating our rules were actually tested, because the LLM may not always execute our rules even though we put them there. Even though I defined it, make it echo that back out so developers can check it. What if I can take that and actually test that with an agent to be able to validate that those absolutely ran with testing? We're working on autonomous agents now to be able to do these things as part of our pipelines, as part of what we're delivering. What I want to focus on here is that it's ok that you're on that continuum. It doesn't have to be a one-size-fits-all that you're all doing continuous autonomous agents.

Key Takeaways

Four key takeaways for this talk. AIFSD is not a one-size-fits-all solution. Choose your approach based on code longevity, or degree of automated verification, those are the two that I use. Use a structured approach when working with an LLM, like RIPER-5. AI doesn't replace engineering discipline, it amplifies the practices. Focus on what are your sensible defaults to be successful.

Questions and Answers

Participant 1: Are you finding success with this approach in large existing codebases?

Wes Reisz: This particular project is Greenfield that we're using. This one I'm specifically applying here. I haven't applied it to a large codebase, but that sensing down there at the bottom specifically is where I have used legacy codebases to understand what we're getting into, like what we're walking into. I've used that in that particular one. RIPER-5 in particular, no. However, remember what I said that RIPER-5, that's how I work when I worked in software. It's not any different. In that research made, that stage, you define a specification of what you want to do, then you research it.

You look at the codebase, and you can tell it as part of your commands to understand what patterns that they're using in the codebase. Then ask me questions on how I might apply this into the codebase to get some feedback on it, and then look at what are some options that I might do. Feed that back to the specification, adjust the specification to make it match, and then keep iterating with it. I see no reason why I wouldn't, but in transparency, I've started this three months ago specifically with this particular team, and it's been really successful.

Participant 2: I like how you summarize what many of us are eventually converging into, and giving it names. That's great for conversation. One aspect that I struggle a lot when going through this process, is actually boredom, because I have to wait for the agent to think, and then generate the content, and then update the definitions, and then generate the code. Have you been able to fully automate any of these transitions?

Wes Reisz: You're interacting with a thinking LLM, at least in the one that I was just showing. It does have some slowness. One interesting thing is that review stage in there. What we can do is generate code, have it review, and look for drift. Then from that drift, be able to do updates back to the specification, if you want to keep that drift, so that the specification picks it up. That shortcuts a bit of the always going back to the specification to regenerate the code length of time. That's a practice that's been able to understand that we want these things, and then update the specification.

Other things I don't particularly love, but latest version of Cursor with agent mode, you can be able to run things more like in the flow, in a chatter face. That seems to allow you to parallelize things a little bit better. Done multiple. I talked to the Cursor team about that same question, and they specifically recommended at that time multiple tabs. Not a big fan of that, because the cognitive load on top isn't great. You could see, even in that specification that I had, it had dependent paths, and the reason why I keep that there is because we're moving towards getting more autonomy and being able to develop agents that can do each of these stages. As we do that, we'll use that to be able to parallelize. We're still getting to that stage where I can do it. That will be my hope on how we streamline some of that slowness. Yes, there is a wait stage.

Participant 3: I do know different Claude's work better in different stages for different processes.

Wes Reisz: Yes. Research is thinking. I use Claude Sonnet for code generation. They leapfrog, though. Primarily for code generation right now, I use Claude Sonnet. I've told my Google folks that Gemini 3 is much better. We'll see. I haven't got there yet. Claude Sonnet is what I use for code generation. I use Gemini for some research things. That's probably just more of a personal preference. I don't have a list I can give you, though, other than Claude Sonnet is what I have found to be my preferred code generation model.

Participant 4: I saw that you say you have 10 to 1 for the specs versus the code, which feels like it takes a lot of time. Do you have a sense whether the output you're generating is faster, is higher quality? Is it more feature-filled? Is it based on intuition, or do you have actual numbers? What dimensions is this improving the software process?

Wes Reisz: On this particular project, I don't, because it's three months, and we don't have DORA yet effectively in place to be able to have real metrics. I have intuition from developers that they feel faster. You've heard in some of the talks that that feeling can actually be wrong when you actually look at metrics, so I don't have a lot in there. We are doing developer surveys to get some of that stuff. On another project, they didn't use RIPER-5, but they used a very similar approach. They are fully instrumented with DORA metrics, and they were, in production, doing continuous delivery. They have actual metrics on what their developer productivity was while holding MTTR the same, using not RIPER-5, but a very similar research, plan, execute, and review type vocabulary that they use, so very similar. On my particular project, for the one that we're running right now, I haven't got the DORA stuff in place to give you a good, firm engineering answer.

Participant 5: Never heard of RIPER-5, but I find myself following the exact same workflow, like you said, without it. When you split up into tasks, for the one example that you were showing, I saw on the Cursor tab on the left that there was nine different .md files, one for each task. I was wondering if that was AI doing something in the meanwhile while it was working, or if you specifically split up the tasks into one .md file each, and if you find improvements with that.

Wes Reisz: I specifically split it out into each individual task so that we have that context, and I have different pairs that might pick up tasks, because they're atomic tasks that should be accomplished on their own. Generally, we find that we don't do that, but the goal as we ramp velocity is that multiple people can pick up different tasks to be able to implement, so I keep them atomic for that reason.

Participant 6: What recommendations could you give to the teams who use Brownfield projects, and these projects are distributed, one example, in several repos, GitHub repositories, or microservices, where we have many microservices, some of them live in the same repo, others in other repos, and even for humans, it can be hard to reason about.

Wes Reisz: What tools do I recommend for a model?

Participant 6: Which frameworks, not tools, but what recommendations in general can you give to the teams who work with microservices, for example?

Wes Reisz: There was a whole talk from Sepehr that actually went through Claude Code and went through Cursor, and specifically talked about some of those. I'd point you to his, because he broke down different ones really well. I specifically focused tooling on Cursor for what I was doing. I have used it, not specifically with RIPER-5, but against monorepos and individual repos. I do, at times, when I have to mount several projects, for example, to be able to look at infrastructure for if I have things separate. Yes, Cursor's what I tend to use. His talk was really good. I'd take a look at that one. He's got a lot of tips and techniques on different tools.

Participant 6: Can we use RIPER-5 for, say, microservices?

Wes Reisz: Absolutely. Yes, RIPER-5 is a logical construct. It has nothing to do with any framework. In the slide, when you get them, there's a repo there. This repo, in the README file, it'll show the rules on how to put the instructions, the LLM, what is meant for research. Whatever tool you're using, you load it into the settings for that tool. I go through how to do that in Cursor here, but you can do the same thing in any tool that you're using. It's totally up to you how you do it. I've seen some RIPER-5 specifically created for Claude Code that you can just clone the repo, but yes, it has nothing to do with the tool I use. I just happen to show Cursor.

Participant 7: I'm curious if you have an example on supervised processing?

Wes Reisz: Not in this particular one.

See more presentations with transcripts