#4 Sep 08, 25

AI DevOps in Practice: A Solutions Architect's View

Rohit (00:00)
Hello everyone, welcome to another episode of AI x DevOps podcast. I'm your host Rohit, co-founder at Facets Cloud. Before we dive in, let me quickly apologize for the longer than usual break between episodes. Over the past few weeks, we have been building Facets Intelligence, which is an AI suite ⁓

which just launched last week. So stick around till the end of the episode, towards the end of the episode, we'll add a small sneak peek into what we have built. Moreover, to make up for the delay between the episodes, we have a special guest with us today. We have Sanjeev Ganjihal with us. He's a solutions architect at Amazon AWS. And he's a specialist in container technologies and brings a vast experience in containers, cloud and more. And he's a very

early adopter of Kubernetes and expert in that area. And he brings first hand insight into the ecosystem on how the AI ecosystem or the AIOps ecosystem is shaping up and we are excited to hear from you, Sanjeev. Welcome to the show and maybe could you please introduce yourself to our audience.

Sanjeev Ganjihal (01:08)
Yeah, thank you Rohit. Thank you for having me here. I am Sanjeev Ganjihal. I'm a Senior Container Specialist at AWS. I support financial services industry. I help them basically run containers at scale and also doing a lot of work ⁓ in the AI/ML, high performance computing for different set of customers at AWS. On a day-to-day basis, again, it's a lot of work.

Leveraging the cutting edge technologies and blending it with containers, Kubernetes, and helping customers to run these workloads at scale. That's what I do on a regular basis.

Rohit (01:48)
And I think I've also heard this trivia about you that you were the 100th person to get Kubernetes certified in early 2017 or something like that.

Sanjeev Ganjihal (01:57)
Yeah, my journey started in late 2016 when I started migrating traditional apps to Docker containers. So that was the initial, but there was a lot of challenge. There was no orchestration layer. I need to manually manage those containers. So when Kubernetes, ⁓ again, out of Google Borg, when Kubernetes came out, I spent a lot of my time.

understanding what Kubernetes is. And when the certification finally came out in September 2017, I thought I'll take a stab at it. It was not easy for people to complete certification because everyone, until then it was all multiple choices. So there was no hands-on exam back then. So this is the first live hands-on exam, which was basically evaluating all your commands, your understanding on Kubernetes. So glad I was able to...

get my CKA back in November 2017. And yes, on the certificate, it says 100. So 100 % again. So basically, that's how it started. So happy to continue my journey in Kubernetes. And I'm still a Kubernetes evangelist, helping customers, helping organizations run Kubernetes at scale.

Rohit (03:09)
So really an early believer in Kubernetes and an early backer. 17 is really early.

Sanjeev Ganjihal (03:16)
Yeah, early, yes, because from virtual machines, lot of virtual machines crawl to a different orchestration layer, which is managing fleet of virtual machines. Again, can be bare metal, can be on-premise too. But yeah, it turned out to be a great investment from a learning perspective and also ⁓ at the work perspective, right? So you basically started seeing career trajectory, right? So you started seeing opportunities, growth opportunities.

And also projects, right? So building large scale projects at different organizations. So I think it was definitely a great journey to stick with Kubernetes, understand the complexities of it, and start building things for organizations.

Rohit (03:59)
Yeah, and probably we are at such a similar moment in terms of gendered AI as well, right? In terms of the interest and adoption that people make right now might eventually shape up their careers, right?

Sanjeev Ganjihal (04:14)
Yeah, I mean, we are 10 years in Kubernetes now and the game is not over, right? So there are new sets of challenges, new sets of cutting edge technologies that are coming in and Kubernetes is becoming the de facto operating system, right? So we call it de facto operating system now because on top of Kubernetes orchestration, you can deploy tons of tools to basically solve different problems. For AI/ML, you have set of tools.

Rohit (04:18)
Yeah.

Sanjeev Ganjihal (04:41)
For generative AI, you have set of tools. For traditional microservices, you have bunch of tools. So I think it's evolving as a de facto operating system for the generations to come. So you had Linux, had Solaris, you had Windows. think Kubernetes will eventually should be shipped as an operating system, is what I'm envisioning, where you don't have to manage the control plane, the data plane. So it just becomes a commodity and an operating system by default.

Rohit (05:10)
So Sanjeev I also was curious on your personal ⁓ generative AI adoption journey. What does that look like so far? ⁓ What is your generative AI arsenal if I may ⁓ look like?

Sanjeev Ganjihal (05:22)
Yeah.

Yeah, so I'll give you my background. So I started my career with traditional web servers, so WebLogic, WebSphere, JBoss, IaaS. So these are all web hosting technologies that I was working on in the initial days of my career. And then the transition happened from web and middleware layers to basically DevOps.

looking at DevOps implementations, leveraging tools like Chef and Puppet. I'm not sure your viewers have heard of those technologies, but yes, pretty old, back then Chef, Puppet, Ansible. Ansible is still popular now, but again, Chef and Puppet, I'm not sure.

Rohit (05:59)
Chef, run book

and things like that used to be popular back in the day. Correct those

Sanjeev Ganjihal (06:02)
Correct, those

cookbooks, so writing all those, that transition helped a lot in the career. So moving from traditional web and middleware to DevOps and eventually to containers, and now basically adding a lot of other layers, right? So generative AI, machine learning, you have high performance computing. So the world is basically evolving and choosing Kubernetes as a de facto platform to run various set of workloads.

So I'm seeing that journey. My generative AI journey, I think, as I talked about my journey from infrastructure, web middleware, to a person who is writing a lot of Python stuff. So I'm more of a Python guy. So for me, writing a lot of complex Python, because we moved on from shell scripting. Python was becoming the de facto in the DevOps world, where you can build lot of

good automation tools using Python. So that was my journey initially. It was complex again, right? So people not coming from a computer science background because I did my bachelor's in electronics and communications, right? So there was computer science, but not a core computer science guy, right? So in the initial days of my bachelor's degree. So my journey with Python has been great.

a lot of journey understanding how Python works, in comparison with Java, .NET and all that stuff. It was definitely complex back then, but look at now, right? So with inventive generative AI, it has trained on massive amounts of datasets from last 20 years, 25 years of data, right? So you're now able to basically just ask in a natural way, I want to do X task. Can you help me with...

code in Python, Node, whatever language that you want. So my journey from the last few years, I was again, I'm currently at AWS. So initial, when the generative AI landscape picked up, I started looking at GPUs before it started going viral. So within the organization, started looking at how we can leverage GPUs,

how we can build training ⁓ inference use cases. So we launched something called JARK stack, Jupyter, Argo, and Ray. I'm not sure if you've heard of Ray for Distributed Computing, and then Karpenter and Kubernetes. So the journey has been pretty strong in the generative AI space, not just building generative AI apps, but also on the infrastructure side.

Rohit (08:37)
So you also dabble yourself in probably hosting your own models and getting them to work on your own infrastructure, things like that.

Sanjeev Ganjihal (08:47)
Yes, because I'm a Kubernetes evangelist, right? So when these large language models came out, it was important to see if Kubernetes is a fit for these models or not, because you've got to pull these models from hugging face, download the weights. So it was not an easy task initially.

to get the models from Hugging Phase, deploy them on Kubernetes, and then you need to serve eventually. So it's not about just getting those models. You need to serve those models as well. So inference was not that easy. Now we have a lot of technologies like vllm, LMcache, we have NVIDIA NIM, we have Ray Serve for serving your model. So the inference landscape has also changed drastically from where it was in 21 to where we are in 25.

So I think that shift, 22, 23 is where AI aggressively started picking up. But before then, there were traditional machine learning algorithms for fraud detection or personalization, chatbots, customer experience. So those are areas where they were leveraging machine learning, but with inventive generative AI, things are ⁓ shaping much.

Rohit (10:08)
Yeah, so I have also, so you mentioned your byte route of Python and LLMs also, you know, are quite fluent in Python for some reason. And so do you use it in your day to day development as well in some capacity?

Sanjeev Ganjihal (10:22)
Yeah, I'm happy to talk about that. So initially, ChatGPT came out, and everyone was going crazy about ChatGPT. So I also started leveraging ChatGPT to some extent, wanted to know what is the predictive nature. So based on the existing data that it has trained on, how can it generate new content?

Wanted to test it out. So my initial journey was with ChatGPT asking some good questions about Kubernetes. The initial days of ChatGPT was a lot of hallucination, right? So data, it was trained on good amount of data, like petabytes, zettabytes of data, but the responses were pretty bad in terms because it was hallucinating. The challenges were, the challenges, major challenges were bad data.

So the responses were not that great. But now, I think the accuracy of the models have gone up. Some state-of-the-art models, they give 95 % accuracy. So anything above 80%, in my opinion, is great. So with human intelligence, and then you have AI to that, it's a lethal combo, I would call.

Rohit (11:32)
So, I mean, I personally have been observing people around me, right? Like those who leverage it for development or even debugging, problem solving. Like some folks have a style where they would have their ID open, they are in control, they are the one developing, but they would like AI to assist in probably, know, auto completing a function, what they've written the docs and they just want help with implementing it.

Get the grunt work done, but they are in control. But then there are a few others who would use tools like Aider or Claude Code what not, or Amazon Q Developer, and think of them as personal ⁓ developer assistants, and just discuss the task with them and leave the implementation completely up to the tool, the AI tool at hand. So which of those styles do you fall into? Like I myself, like...

tend to fall to the second one right now. I have forgotten the time when I have actually been actively looking at an IDE for a long period of time. Rather, I would just be talking to Claude code, discussing the problem, figuring out an approach, and just telling, hey, go ahead and implement it. And then I just look at the test case and whether it is running fine or not. So which of these brackets do you fall into and what would you advocate?

Sanjeev Ganjihal (12:46)
I think I'm a mixed guy because I do not want to rely on a one single tool because I believe different models, different tools give different perspectives to different problems. my approach is basically, if you have a problem statement, evaluate different tools, see which one is basically good at code generation, which one is good at the logic as well. So how do you build a particular app? You need to know the...

⁓ end-to-end of building a particular app. So once you have an idea, once you have a logic, you need to have a proper structure. So my approach is always ⁓ mixed. I have Visual Studio Code always open. I have Q Developer. I use that to a great extent. I have Claude Code I use that as well. And I'm not a big fan of ChatGPT now because I think Claude sonnet

has overtaken the benchmarks and I feel it's much more capable than ChatGPT I'm not, again, this is just ⁓ my take. Do not consider this as ⁓ coming from my employer or this is just my personal take because I've been using GPT I is doing ⁓ much better and Q developer that I use on a day-to-day basis is also

leveraging Sonnet Claude under the hood. So a lot of good stuff is happening there. And there are some open source models that I also use, which I run locally in my personal laptop. And if I want to interact with them, I'll pretty much use that as well. So there are some AI assistants that you could deploy. They'll basically complete a lot of... You just tell...

You just give it a read me a markdown file and it'll give you a proper flow as to what needs to happen, right? So from that you're taking an idea. Now you're structuring that idea you'll be prototyping that idea and you'll be testing the idea and then eventually launching the idea, right? So there are multiple steps involved. It's not like just go tell claude. Whatever claude generates if you just blindly take it it's gonna have tons of

loopholes as well. This is where prompt engineering is also very ⁓ important. But again, you need to understand what type of app you are building and what is the use case of that app. You need to keep in mind the security aspects of it, whether you're basically leaking anything which can be compromised down the road. being at a cloud provider at a hyperscaler, security is always a top priority.

AI security is always very important because data is money end of the day. So if you compromise on data, you're basically compromising on everything.

Rohit (15:35)
Yeah, so I think I concur with you on the sonnet performance off late. I've also found it to be way better, especially on technical tasks. It really excels. And we also used to, I don't know,

I've tried tools like aider which was also a programming assistance sort of thing. I ⁓ have also tried codex which is the OpenAI equivalent of Claude code and also had dabbled in writing our own agents which could which was more like Claude code style but uses leverages the ⁓ OpenAI models. But we too recently have found that Sonnet is outperforming the other computation by a margin.

And you also talked about the structured way of doing things that is also something that I loved with Claude code the CLI interface where I mean it plans creates a checklist and.

you know, proceeds through the checklist automatically. This is something that we tried in our own agents as well. Like we have a Kubernetes agent where it also plans ahead first with some prompt engineering, of course, but it has tools to keep track of the to-do items, keep track of what is done, what is not done, what is canceled and proceed. And then you could see a lot more of, you know, autonomy in these AI agents because now they are planning and doing things much more

Sanjeev Ganjihal (16:55)
Absolutely.

Rohit (16:58)
like us, like we as a team, how we operate. And that's really beautiful to see. And Claude code was a big inspiration for that because they pointed us in this direction where you have a checklist that you go through. yeah, the structured approach is seriously important. And I think anybody who is leveraging the AI tool chain should definitely employ that and probably bake it into your system one way or the other.

Sanjeev Ganjihal (17:25)
And then good stuff is if you're doing DevOps or traditional code generation, it's good to see how different state of ⁓ art models are doing. For example, if Claude is generating some code, you can take that code and ask GPT or even Llama and ask, hey, what is the accuracy of this code? It'll tell you from a security perspective, this is this code of like 70 out of 100. From implementation-wise, logic-wise, this is at

⁓ maybe 60 or 100. So you'll basically get that logic and then you'll also know what are the areas of improvement. So always, instead of just sticking to one particular tool, you could basically take whatever the current code is generated by one of the models and maybe use it as an input for another model, wherein you learn what are the problems. Like you catch...

what are the loopholes that a particular model has generated, the artifact that it has generated. So that's another approach I feel is very interesting. It might take some time, but it is worth the investment because now you're basically working with all state-of-the-art models and ⁓ taking their input.

Rohit (18:34)
Yeah, that's interesting. So we have been employing very deterministic checks, say for example, if it's writing some terraform, we used to always have a checklist set and that they, run checkov and make sure that the check out checks are passing. Now, what you're suggesting also seems like an interesting quality gate to add, then another reason with another model probably reviews the code and gives some suggestions as well, so that if there are some blind spots, it catches it early.

Sanjeev Ganjihal (18:57)
Yeah, this is where the multi-LLM routing comes into picture. So when you build these agents, can route certain. For example, if Sonnet is great for code generation, so you can leverage Sonnet for that. And let's say there's another model which basically evaluates your existing code generated by claude and tells you, these are the areas of improvement. I did xyz ⁓ checks, and this is what I found. And then it can go back as a feedback loop back to claude

Claude can tell these are things that I want to be enhanced. Again, competing products. Again, I'm not just telling go pick an LLM which is not good, and then you feed that to claude is not good. So you need to have competing benchmarks. Some Chinese models are amazing as well. So again, you don't want to share proprietary data with ⁓ Chinese models. But again, it's good to

for, general apps, it's good to kind of take that route, see how they are performing, what are the ideas that they're generating. It's always endless possibilities in my opinion.

Rohit (20:05)
In your experience, since you work closely with customers, how do you judge the industry is adopting AI? Are there some forerunners? Are there some people who are probably making some mistakes or lagging behind or going slow on it? What is your general judgment of how the software industry is embracing AI, whether it be for development or for building their products?

Sanjeev Ganjihal (20:31)
Yeah. So I'll give you some nice story. So back in 23, when everyone was talking about generative AI, it was all about GPUs back then. So Nvidia as a company was the pioneer in the GPU industry. So it is still a pioneer ⁓ building GPUs. What I saw was things that were running on traditional CPUs, traditional memory, I saw those workloads accelerate.

⁓ faster with the usage of GPUs, right? So lot of demand for ⁓ GPUs back then and slowly the transition that I started seeing across the customers is moving from traditional CPU-based architectures to GPU-based architectures, leveraging on Kubernetes again, by the way, leveraging device plugins, Nvidia device plugins, leveraging different GPU optimization strategies

like time slicing or multi-instance GPUs. So customers kind of started doing the initial shift, which is basically moving from CPU architectures to leveraging GPUs to accelerate the performance. If a particular task can be done, let's say in an hour, they were able to get it down to 10 minutes. So that's a win using GPUs. So this is where NVIDIA was kind of telling.

With the usage of GPUs, you'll be able to perform faster. You'll be able to do things faster. So that was a shift that I saw in 23. And slowly in 24, we started seeing generative AI apps being built across the board. It was all moving from initial experimentation phase to prototyping phase. And we saw some amazing generative AI use cases that came out in 24.

25, it's all about agentic, right? So now that we have all these generative AI apps, can these agents do things for us? Why do we have human in the loop? Can they do all these collaborative tasks together? So this is where agentic AI is starting to pick up. And I think this year, it's all about agentic AI and customers kind of exploring. I wouldn't say customers are...

⁓ every customer is taking agentic in their production environments. It's still in the experimentation phase for various use cases. The challenges with agentic AI is what if there is a malicious prompt that is injected into an agent or one agent goes rogue, what about the other agents? How are you observing it? How are you managing that kind of situation

in your production environment? So there are some security loopholes today for the agentic part. What if there is something that one agent needs to do, it doesn't cascade the next task to another agent? What is the mechanism there that we will employ as humans? So there are good things that we need to take from the agentic part. It's all autonomous, but there are challenges that we need to keep in

Rohit (23:37)
And in terms of their productivity, you know, just like your general coding and that sort of work. Do you see a pattern as to how effectively people are able to leverage? I've heard like...

I've read through a DORA report from like I think it's a year and a half back where Google claims that people who have adopted AI in their development processes are yet to see any you know valuable outcome out of it. Maybe it has changed in the last year because things have been changing left right and center in this space.

How, in your observation, have teams been able to unlock an extra potential or extra productivity out of these AI development tool chains? Or is there still a little bit more of waiting game to be played to actually articulate the results in terms of numbers or the effectiveness?

Sanjeev Ganjihal (24:30)
Yeah, I think as of now, again, I'm going to be general. Again, these are consolidated observations, right? So one, the biggest problem organizations have today is the cultural landscape, which organizations have to solve. Because if you look at large organizations, they have 10,000 plus employees, some 5,000 plus employees, some 1,000 plus employees, some maybe 100,000 plus employees, right? So it's all about the

⁓ cultural and structural changes within the organization. And this is where I want to talk about three different personas, right? There are AI evangelists who will just say, replace everything with AI, which is basically, in my opinion, too wild, right? Because if AI is doing everything, we are not in a state today where it does everything 100%, right? So

we are in a position where it does maybe 60%, 70 % accuracy. There is still 30, 40 % gap that I do see in AI. So it's important to kind of keep that in mind. And organizations are worried about that 30, 40%, right? So if they give complete control of the DevOps or their GitOps framework to an agent, how are we going to basically manage problems? What I hear is,

For DevOps, the complexity is growing because they're adding, it's all about learning, right? So lot of human beings either need to adapt to the latest technologies. If they are not adapting, there is a gap. So somehow the option is fill the gap by learning what is the cutting edge, learning about generative AI, learning about machine learning,

learning about artificial intelligence and kind of leveraging them in your day-to-day workflows. The other option is basically you only pick low risk and high improvement areas, right? So organizations basically see, hey, this is the least risk. Can we experiment using AI in those areas? It's not impacting production yet, but we will basically take baby steps.

And this is where I see most of the industry is, like, they are taking steps. There are some, what do I call, like, Greenfield applications or Greenfield customers where they can go do anything, right? So they don't have to worry about customer base, their end users. They just go build cool stuff. They'll sell the cool things. For them, it's easy. For some organizations who are heavy into

DevOps practices, platform engineering practices, or GitOps practices, or leveraging traditional mechanisms to do them, it's going to be a huge change because now adding AI, they need to kind of rewrite a lot of things. What used to be Jenkins or Argo or Flux pipeline before is now something else, right? So it's an agent kind of doing

different balances and checks for your ⁓ continuous integration or continuous delivery. So I would say the adoption rate is good, but not excellent, right? And it's all use cases driven. Everyone loves to see the world is changing because of GPUs. The world is changing because of generative AI. But what are the use cases that I'm building out of it?

In different industry, for example, health care, can I do drug discovery? Can I do genomic sequencing? Can I do molecular sequencing? In financial services, can I do fraud detection? Can I do personalization? So there are different use case driven things that are important, especially when you're doing the or when you're using AI in your production.

Rohit (28:29)
I think it's time all organizations invest in some small focus group who can go scout for these opportunities where it is safe and meaningful to experiment on AI bits and do that rather than blanket try and move to AI 100%.

Sanjeev Ganjihal (28:43)
Yes.

This is where I'm seeing like, ⁓ if you see Meta, for example, right, they have this super intelligence team, right? So they have now built a super intelligence team where they'll just be focused on cutting edge models, building cutting edge models, and folks will basically be responsible for that AI focus area, right? So similarly, I see organizations kind of converge different roles. So what they'll do is,

What was DevOps today will evolve into something else. What was platform engineering will be evolving into something else. So it's a culmination of different roles, maybe getting different titles per se. For example, model ops engineer or model operations engineer, right? So what was DevOps engineer before is now maybe a model operations engineer because they are dealing with large language models day in, day out. They need to maintain those models.

It could be a new title for them. But what I see is the roles will kind of start getting a new identity and rebranding. And they may do AI work in conjunction with existing work. It's not like you'll be replacing everything in one year. It takes time. But they'll basically adapt it step by step. That's my take.

That's the observation that I see because if you disrupt immediately, it's going to have a huge impact.

Rohit (30:16)
So I think since we are on the topic of use cases and also since you are a Kubernetes expert, so I think one of the areas where there is a lot of disruption, mean there is a lot of talk of employing AI is on the SRE side of things, know, operations of Kubernetes. So there are tools like K8sGPT or Google's kubectl AI, you know, that

made the lives easier for engineers. Surely there are MCP servers like AWS's own MCP server or things like Clickhouse MCP, whatever makes information retrieval and also configuration way easier. But we are yet to see some autonomous agent-like application of AI, generative AI in this space of monitoring,

trouble-shooting, probably automatic resolution of ⁓ incidents or I mean that is the vision that people you know probably talk a lot about on LinkedIn but I am yet to see something concrete that that you know just works right. ⁓ So what do you think is the scope there and I mean I personally think that it is not too hard or not too far away to have you know an agent

help in very specific use cases where human need not be in the loop and can be completely autonomous. What is your take on that front, on the SRE front?

Sanjeev Ganjihal (31:41)
Yeah, I think the SRE front, the job of SREs are kind of fading away in my opinion, right? So if you see Google, they're not talking much on site reliability engineering now. There was a phase where site reliability engineering was getting a lot of attraction because you're now focusing on toil. You need to basically cut a lot of repetitive tasks that you do on a day-to-day basis, right? And there were different

⁓ SRE perspective terminologies like service level objectives, service level indicators, MTTR. So there were a lot of SRE terminologies that Google kind of wrote in site reliability engineering book. So what I see with AI is some of it is basically going to go away. For example, HolmesGPT or K8sGPT

look at kubectl-ai. So there's going to be a ton of tools, even Amazon Q developer, right? So as long as you have access to your Kubernetes config, basically kube config, you're good. You can just say, go to that namespace, see if the app is running. So everything is going to be evolving into natural language processing, right? So end users, irrespective of whether they're DevOps engineers, they're developers, they're architects, or they're even managers or...

There are even VPs today, right? So eventually what I see is natural language processing taking over a lot of troubleshooting, visualization, forensics, security. These aspects will eventually be baked into tools. We are not there yet. We don't have a concrete tool today that replaces a site reliability engineer today. We still need humans. This is my take.

Again, I'm going to talk about the side effects of it as well, side effects of AI, which is again cognitive aspects. But to your question, SREs will have to work with AI because their earlier focus is, I have 20 Kubernetes clusters. My job is basically to keep them up and running because...

My application teams are responsible. They're deploying their workloads. They are running their workloads. And my job as an SRE is to basically provide that platform or keep it running 24 by 7. So that's the goal. But what they need to do is they need to kind of learn about GPU side of it. They need to learn what is inferencing. How do you solve inference latency problems when you're dealing with large language models and you're hosting them on Kubernetes?

Do they understand what is catastrophic in terms of downtime? So inference downtime is too expensive. For example, if your applications, they rely on large language models, and you continuously keep making inference calls to the app, and if it's down, it's too expensive for the organization. So SREs need to understand those aspects. They also need to train on

whatever they are doing, right? So basically share their knowledge and kind of make the models even better. So there are a lot of things that SREs will have to do, will work with AI for now and kind of make it best of both worlds, right? So human intelligence and then artificial intelligence.

Rohit (35:06)
So I was also thinking that probably they could invest, SREs are famous for building a lot of tooling, like, whether it be for monitoring, log collection, whatever. So I think this could be another set of tooling, agentic tooling that they could build. Say for example, deploying and running.

⁓ Grafana loki for log collection at scale that is a hard problem because there lots of components, there is lot of metrics to keep track of but in the end if you can get it keep it running it is a cost effective logging solution and everybody desires it but it takes probably more engineers to maintain it than you know probably let's say an elastic search cluster

So I was wondering probably people could invest on writing agentic, mean probably very specific to Grafana Loki, who understands Loki in and out like it has

the documentation made available for it, it knows all the configs, it has access to kubernetes as you mentioned like if you have the kubectl and the tools made available it has access to it. It should be able to tweak the configs and adapt to changing situations because logging is one such thing like you actually have to adapt to the traffic situation or you know

business realities or some changes in your architecture, you need to keep adapting. So these are places where I thought fully autonomous, know, agentic workflows could replace a lot of the repetitive work that, you know, the monitoring engineers are

site reliability engineers, would have to do. So, I think the tooling that we built right now should also have an, I mean, so probably we should add a dimension of how can I leverage AI to remove the human out of the loop in many of the toolings as well.

Sanjeev Ganjihal (36:48)
Yeah, I think you bring a very good point, right? So what I definitely see is what it takes today to run a lot of stateful applications, be it observability, be it service mesh, site reliability engineers basically spend a lot of time in managing. Again, coming from Kubernetes background, initially managing Helm charts, deploying them, we move from that model to operators model.

So we have operators which basically automate a lot of stuff. Think of them as agents, to be honest, right? So we move from that journey altogether, manually managing those deployments to an operator basically taking care of end-to-end installation and troubleshooting to some extent, right? But what I definitely see is the autonomous part. It needs to work with solutions like MCP today. So you can have an agent

Rohit (37:21)
Exactly.

Sanjeev Ganjihal (37:44)
As long as the end product can be Grafana, can be Loki, can be Prometheus, you need to have some sort of ⁓ connectivity, and that is solved by model context protocol today. You can have an agent kind of rely or work with MCP agents that are made available, which has the ability to go check something, run something, talk to the underlying LLMs, do a lot of stuff. So I feel...

That is where the industry is heading right now. Again, MCP has its own challenges today in terms of security, in terms of OAuth. There are a lot of security loopholes. We can talk a lot about those loopholes. But what I feel is, as long as you're using it sensibly, not exposing your proprietary data, it's still a good experiment to see how you can bake

⁓ MCP, autonomous agents, and kind of leverage it for ⁓ end-to-end troubleshooting. So it's not just from a troubleshooting perspective, it's also about scaling. So today we have traditional scaling solutions like HPAs, VPAs, Karpenter. There is no predictive or reactive autonomous auto-scaling solution which will basically scale it. It looks at what your workloads, it anticipates

there may be traffic coming in. So we don't have anything which is more predictive based auto-scaling today. So those are some areas where I think SREs can kind of, not just SREs, I mean, in general, any engineer would love to focus on problem statements and build these tools using generative AI. And I am myself exploring a lot of problem statements and see how AI can help me

build some cool stuff, right? So it's not like our intelligence is majorly 60%. This is what I would say, because do not, if anyone tells me this is all 100 % AI, then I would definitely not agree, because without you prompting, without you telling, and without you giving your insights, it is tough, because we haven't achieved 100 % AGI, which is like human-like behavior.

So it's still, you're the guy, think of it like, if you're familiar, again, I'm gonna give some religious aspect to this conversation as well. Think of it like Arjuna is there, Krishna is steering the chariot. So you are the Krishna who is steering the chariot, right? So that's how you're vibe coding with your large language model.

Rohit (40:27)
So I think you also touched upon MCPs and challenges with it. So one of the weird, mean, one of the nitpicks, probably it's not a nitpick. So especially given that I also dabble a lot in the infrastructure side of things, I somehow feel that some of these MCP offerings, like let me just take the example of the AWS one itself. It kind of makes it easy for customers to indulge in "chatops".

So for example, the AWS MCP server includes the EKS subsections or sub-server of it. I love that it generates CloudFormation templates for creating kubernetes cluster. It doesn't make API calls and do it. Instead there is a CloudFormation template so that you can modify the template if you need and you can tear it down, whatever. It is still following the principles of declarative infrastructure provisioning.

At the same time, the MCP curiously has apply YAML or patch Kubernetes resource sort of functions made available. So I fear that somebody new to, somebody from this generative AI era, like somebody born into this particular era would just feel that it's okay to let the MCPs apply whatever YAMLs and there is no single source of truth, there is no GitOps per se. He's just sitting on his IDE and saying, hey, just update this. Maybe in his dev environment, it's fine.

But he definitely does need to follow the principles of having a single source of truth, then having a progress through environments to production. This sort of discipline might get lost. So I think from organizations like AWS, like while I understand that these are reference implementations, you would expect developers to fork it, customize the template, customize the tooling that is available, I have a deep fear in my mind that

you we might go back to clickops but it just that now it is chat ops and we are just asking chat bots to do whatever we want. What is your take there?

Sanjeev Ganjihal (42:22)
No, I think great points, because you're asking, you're using an MCP server, and you're relying on it to basically do all the apply functions and generative functions. But what I also feel is you've got to have checks and balances as well. You've got to have static code analysis. You've got to have basically

⁓ linting capabilities, you've got to do a lot of security analysis. Do you have anything which is sensitive that you have generated? There needs to be all these checks and balances. And once those checks and balances are all like, and also testing, right? So you ask to create those test use cases, test it out, and then roll those things into your existing environments. Before that,

Let's say if you just go and tell, I need to deploy a web application. I need to have five replicas of it, maybe 250 millicores of CPU and one gig of RAM. It's going to generate a YAML file for you, and it's going to go even create a Docker-container image, push it to ECR. It will do all those things for you. But do you have those checks and balances in today? Are you running it as a rule?

What is the security context that you're using? Do you have any admission controllers, which will basically validate whether the YAML that you're applying is good or not? So those are things that are kind of evolving, I would say. I'm not telling they're not there today to some extent, because MCP is pretty brand new. So Anthropic has released that. So I think eventually,

Next year, by now, I think you'll see lot of advancements, even with the MCP. Or you may have something new. mean, MCP might eventually fade, and you'll have some new technology next year. So yeah, there are some gaps, would say. But your points are spot on, because eventually it's clickops but it's just turning out to be chatops You're just chatting in a natural language way. You're just speaking English, and things are getting done.

Rohit (44:36)
Some of the parts that I love about the AWS MCP is that as much as possible they do you know generate CloudFormation templates to do things so that it is reusable, it is still in one place. So that makes a lot of sense because we ourselves in our product

we spit out tofu or terraform, whatever that sort of things from our MCP. So you could prompt it to generate these sort of modules that you can reuse later. And we feel that that would be the safest way to do things. So to give an ⁓ analogy on the YAML side, it would be rather better for us to have generative AI generate the CI pipeline that would actually deploy it into a Kubernetes and you still just commit to your GitHub.

You might vibe code the YAMLs, that's fine, but you still have a proper structured way of delivering things to, know, dev through to staging through to production. That would be way, way better.

Sanjeev Ganjihal (45:28)
Yeah, you would wonder I'm using Q developer for that. And my Q developer will basically come at a lot of changes, iterate on ⁓ building net new YAMLs or Terraform, or even creating these Argo specific manifests and then applying them to my, pushing them to my repositories and applying those changes in my existing Kubernetes clusters. Again, as a solutions architect, it is important to prototype.

whatever things that you could implement for customers. This is all about leveraging cutting edge and blending in to solve customer problems. So this is something that I feel is interesting. A lot of developers, DevOps guys, eventually will leverage agent tech solutions like these and also generative AI solutions like these to basically in their day-to-day activities.

and have those checks and balances, and then they will be able to kind of use these solutions as their programming assistants or think of bare programmers. So it's just a virtual programmer with you, but they're all doing good stuff.

Rohit (46:40)
Yeah, so I also wanted, I was a bit curious on the self-hosted, models seen and I have seen a lot of people have, problems with, data being sent out to some publicly hosted, LLMs and things like that. So I think that is where, you know, bedrock and probably even some of the

pure raw compute and GPU power that AWS brings to the game so that you can deploy and maintain your own models. So what, I mean, so when, when do I do that? Like do you expect that in five years time, most enterprises are going to have, or most companies are going to self-host their models, or is it only for a niche subset of customers who would need it? And what sort of expertise should one bring in and what Amazon offers?

Yeah, lots of questions but yeah, so in general on the self-hosted model seen.

Sanjeev Ganjihal (47:40)
Yeah, it all boils down to the level of knowledge the team members have, the amount of operational complexity they want to take on their head. And they basically are interested in learning cutting edge, implementing cutting edge technologies. These are some traits that you need to look for when you want to self-host. So first, you basically are completely open source. You want to have your own state of art models. Or you're dealing

with regulated environments where your security teams are telling you cannot go basically deploy or leverage managed offerings. So some of those areas where you'll basically focus on building your own AI data centers. So I call them AI data centers. So if you see today all these large hyperscalers, I'll again take Meta as an example. They are spending a lot of money on building

AI data centers, their cashflow is kind of going down because they're spending a lot of money on buying those GPUs. Again, even if you look at Elon Musk, Elon Musk's Grok has basically leveraged xAI is leveraging a lot of GPUs. So they're all basically self-managing ⁓ or having their own AI data centers. I think there will be a time where customers will eventually want to do that. So they'll have Kubernetes in their own

on-premise infrastructure. Again, if they have data-specific problems. the good thing about hyperscalers is though you have these models, you have different techniques, right? So you have something called retrieval augmented generation. There is caching, C-A-G, ⁓ techniques that customers can use. And also when they deploy in VPCs, those VPCs are pretty much private, right? So they're not...

shared with other customers. They have their own VPCs. They have their own subnet siders. They have their own security groups, network firewalls. So from a cloud hyperscaler perspective, I think that's a myth when you say you'll expose your data to the cloud providers because not every service is basically scraping the data, right? All they're offering is a managed solution. And you control your data.

This is why hyperscalers, put the message security is shared responsibility. So we are all dealing with shared responsibility model. So the take is, if you want to self-host it, you need to be ready or have appetite of understanding what the underlying infrastructure that you need. Do you have expertise? Do you have time? Do you have money? And do you also have

the fail fast approach. Like today you may build an AI data center where you want to host your own models, but what if it doesn't go the right way? Can you fail fast? So these are key traits that one needs to look at before they take the approach of self-hosting their infrastructure. Most of the organizations would be in a good position looking at hyperscalers because elasticity, the computer's available.

Different accelerators are available because you need GPUs. If not NVIDIA GPUs, you're also looking at hyperscalers, like building their own chips. For example, AWS has Tranium, right? So we have Trainium chips. We have inferentia chips. These are all custom chips that we develop. And then if you look at Google, they have TPUs, Tensor Processing Units, right? So they have their own kind of GPUs

that they offer to their customers. So there is a lot of good stuff that is available in the hyperscale world. Different hyperscalers, they have different offerings. So it makes it easier to kind of get started. The bigger challenge organizations have is how fast can they start something? So that's where you've got to look at whether you want to do self-hosted versus something that is already made available. Again, not all features will be available like

You cannot expect a particular service to be production ready all the time. Like it may be there with 80 % capabilities. So there will always be gaps in a particular product with some features missing. But hyperscalers eventually will build solutions to address those gaps.

Rohit (52:04)
And would you expect also more customers to adopt something like ⁓ bedrock and right now OpenAI OSS models are now on bedrock and probably

the difference between the cutting edge and OSS ones wouldn't be as big. So do you think customers would rather than leveraging the public offering would tend towards having something on bedrock and leverage that just to cover their basis in terms of they own their data and the concerns around PIA and sensitive information leaks.

Sanjeev Ganjihal (52:38)
Yeah, when they leverage Bedrock, right? So Bedrock is an amazing service, to be honest, because we offer 100 plus foundation models, right? So customers can come pick different state of art models for different use cases. So they can have code generation use case, chat bot, personalization, fraud detection. They can leverage Bedrock as a layer where they can come use these different foundation models.

for their different tasks or activities, all their use cases. So, Bedrock is in a very good position to offer to the customers. And again, the good thing is your data is not basically hosted by Bedrock. You're just interacting with different foundation models. You can bring your own data. You can have your own guardrails like VPCs, security groups. So, those AWS constructs

are still valid for customers. So we keep those, like when a particular service is designed, it is important we follow all the well architect principles, right? Security is always a top priority for our customers and our services are basically built in a way which is highly secure. We have 200 plus services by the way, right? So one of the hyperscalers,

Pretty early in this, and now we have 200 plus services which are widely used by a lot of organizations. So speaks volume about what AWS has been doing, even in the cutting edge space, like launching Bedrock, launching Agent Core, launching Strands Agent. If you're familiar with kiro we launched something for, what do you call it, framework-driven

prototype-based development as well. So kiro is something that customers are liking too.

Rohit (54:29)
Yeah.

Yeah, I think strands agents is also really exciting development from AWS and I think that like I personally haven't used Q, but I'll definitely give it a try now that I mean, I honestly did not know that it uses sonnet under the hood, but yeah, I'll definitely give it a try. It seems exciting. I've used the AWS MCPs and all of that good stuff. Bedrock and Q is next. What else should I try out?

Sanjeev Ganjihal (55:01)
I think

Oh, there are so many things that you could try out. We even launched agentic AI courses as well. So there's a lot of AWS brains coming together, helping customers as well. So you can go explore the agentic AI courses. We talk about different AI offerings. Again, if you're familiar, we have SageMaker, we have Kubernetes. Customers can pick and choose the underlying platform. They can go with Bedrock.

And Strands also got a runtime layer now, right? So the Agent Core has a runtime layer. It has memory. It has the gateway capability. So look at it as a new compute offering as well, because you need to host it somewhere, right? So Agent Core runtime is where you can run a particular thing for hours, right? I think I need to make sure my memory is good, but I think eight hours or something, you can still have a particular app running, right?

Rohit (56:03)
Yeah, so I think ⁓ before we close out, I think what do you think of the future with AIops specifically? So specifically I would say how do you expect the roles to change like the traditionally what ops or I mean in an organization what were the different traditional roles. How should somebody adapt and adopt?

And also what could be some of the pitfalls like I know that cognitive decline like you already briefly mentioned. So what is your personal philosophy around that? Like that can't be the reason to avoid AI altogether but I think we need to all find out some personal way of making sure that we don't lose our skills while we don't lose space by not leveraging AI.

Sanjeev Ganjihal (56:46)
Yeah, I think cognitive abilities, you rightly called out, are declining. There are lot of famous universities doing a research on this. Already there is a cognitive decline of, I think, 20 % cognitive decline already, relying heavily on AI. Because if you go, you can just go basically prompt. You're getting answers, but you're not

thinking if that answer is right or wrong. So you need to basically see that. The other transformation challenges is basically, as I mentioned, the roles. You'll see roles evolve from DevOps engineer to maybe AI/ML platform engineers or model operations engineer or AI safety engineer. So AI safety is also important. So you'll have some roles around AI safety engineers. So there's going to be some

team dynamics, what do I call it? Shift in team dynamics as well. So you'll have a mix of AI evangelists who are like everything needs to be AI. And then there are AI skeptics who feel, hey, AI is just an auto-completion tool. I don't see much value. And then there are AI pragmatists. I'm one of the pragmatists. I believe use it in the right way.

use it where it is very beneficial. Don't just go, say, do AI everywhere, but look for problem statement and be a pragmatist and use AI wherever it is possible. And one other thing that I want to talk about is the convergence of DevOps, SRE, LLMops, MLops, AIops

I think these will all eventually be relying heavily on AI. you'll see, if not agents, there will be some other framework down the road which will provide autonomous capabilities. If not agent, because McKinsey is telling, ⁓ McKinsey, again, they do a thorough research. What they're telling is organization adoption is just 8 % in AI, which is...

crazy, right? It's just 8%. So there are still 88 % or 92 % organizations which need to tap into the AI side of it. And also, I think there was another report where they agentic AI dying by 27. So there needs to because it has its own challenges today when it comes to collaborative tasks, memory sharing, and

handling prompt injection attacks, handling failures. So there are a lot of things that we need to kind of focus on evolve in this space and see if agents are the right solution. If not, there could be something that will come down the road. But I definitely see AIOps space maturing, more opportunities coming for AIOps engineers down the road because

You need to know how things work in AI/ML space. You need to be adaptive of it. You need to have operational background, operations background. So it's a combination of AI and operations. It's not just about observability. It's about observability. It's about networking. It's about security. So it's about machine learning. It's about models. So you basically need to have a lot of

knowledge in this. One, upskilling is very important. So there are tons of courses. Again, time is running fast. Things are moving fast. Upskilling, we have pretty limited time because we need to focus on family. We need to focus on career. We need to focus on our own well-being as well, which means we need to spend time on things that we like to do, playing cricket, playing tennis. So I think life is

You got to basically balance that out as well. It doesn't mean you just spend a lot of time on all the cutting edge and sacrificing your precious time, which God has given. So I advocate on that as well. Dedicate some time for family. Dedicate some time for learning new technologies. And also, do not forget about living the moment, enjoying the moments

These are things that are very important an individual has to do.

Rohit (1:01:13)
But I think that's a beautiful note Sanjeev to end today's conversation. And I think there is a lot more to discuss and we'd love to have you again on the show, maybe in a year's time when the AIOps arena has matured a lot more.

Sanjeev Ganjihal (1:01:29)
Absolutely love to share whatever may happen in the future. I think possibilities are endless. We'll definitely see AI shape things much better. Sad part is there will be some role eliminations for sure, because if AI will eventually come for certain roles, because software developer jobs are declining because now it's

all these large language models are writing code efficiently. So similarly, there will be some roles that will eventually fade. But that's the natural part of it when you get these cutting edge technologies. Like how .com in 2000 kind of changed the internet 1.0, now we have 2.0, we have Web 3 now. So I think disruption will be there, but again, things will continue.

Humans will still have different opportunities to work.

Rohit (1:02:23)
Yeah, so I think humans are good at finding opportunities and building things.

Sanjeev Ganjihal (1:02:27)

I still believe, don't consider me like too, don't think this as a religious statement, but again, God's creation is amazing, the human brain is amazing, and human brain has lot of analytical capabilities and reactive capabilities in terms of unforeseen circumstances, right? So I think that is something which

AI will not be able to give the human touch or give the sense of being a human is definitely not going to be possible with AI.

Rohit (1:03:00)
Thanks Sanjeev thanks for all of your insights. We will see you again in some time. Thank you. It's our pleasure. Thanks Sanjeev.

Sanjeev Ganjihal (1:03:02)
Absolutely. Absolutely. Thank you very much for hosting. Thank