OUR THOUGHTSTechnology

Platform engineering principles that actually work for teams

Posted by Reuben Dunn . Jun 24.25

Lately, I’ve been thinking about the concept of principles. I coach basketball based on principles rather than set plays and, like basketball, platform engineering represents complex adaptive systems.

There are many parallels in the DevOps/PE world – you can’t control what the defence is doing in basketball, just like you can’t always control what the code tells the computer to do in DevOps.

Here, we’ve extracted principles from our platform engineering work to help coach platform engineering teams just as you would a basketball team. These key principles have emerged from our experience of building platforms and are derived from real-world implementations that actually work for teams.

Align platform and system architecture

One of the most common failures in platform engineering occurs when platforms are built in isolation from the systems they’re meant to support. We regularly encounter organisations where deployment complexity becomes a challenge because those responsible for releasing software have no context of how it was built and how to deploy and release it. This disconnect between development and deployment creates unnecessary friction, technical debt and frustration.

A well-architected platform matches the workloads it supports. For example, there are things that a serverless platform does better than a containerised platform… and vice versa. Development and operational context should inform your platform decisions. Yet too often, we see platform choices made based on comfort zones, existing skills or a local optimisation for a specific group rather than understanding the actual requirements for the broader system.

A good example is configuration management. Instead of having code that needs to know what environment it’s running in which may be easier for development teams to get started, the environment can inject the necessary parameters. This aligns with the 12-factor app principle that configuration comes from the environment, creating cleaner separation of concerns and reducing deployment variance.

The architecture should be known not only by those who build systems, but also by those who deploy and manage them to get the best economic outcomes for the organisation.

Eliminate variance, but cater for preference

They say DevOps is about eliminating variance and therefore toil, but there are situations where variance can be treated as a preference. As a platform engineering team, winning hearts and minds is important – and you won’t win many if you don’t acknowledge the important preferences people have that help them be productive and in the zone.

Harmful variance includes deploying to environments differently, having manual steps in one environment, while using infrastructure as code in another. Other examples include inconsistent naming conventions between services, variance in configuration approaches (one service using environment variables while another uses different mechanisms) and inconsistent CI/CD tooling approaches.

However, development preferences like IDE choice or, within reason, programming language selection can be accommodated without compromising platform integrity. From a recruiting and retention perspective, mandating specific tools can damage your hiring brand. The cost of supporting multiple IDEs is relatively low, especially with containerised development environments.

Choose your battles to strike the right balance. Eliminate variance where it creates operational risk or complexity, but cater for preference where the cost is minimal and the benefit to developer experience is high.

Understand trade-offs

Platform decisions often appear straightforward from a single perspective, but reveal hidden complexities when viewed holistically. What seems cost-effective from procurement, secure from security or well-architected from an architectural perspective can carry significant hidden costs that emerge over time.

For example, consolidating dev, test and UAT in a single Kubernetes cluster might look economical initially, but can create ongoing complexity in namespace management, configuration variance and a deployment process that differs from production. This can incur hidden costs when teams spend hours managing environment-specific naming conventions and deployment logic that wouldn’t exist with dedicated clusters for each environment. Does accepting this toil and operational risk make sense given the operational cost savings of consolidating downstream environments to a single cluster?

Trade-offs become particularly complex in larger distributed systems and when more people are involved. Security teams might mandate specific tooling that appears cost-neutral but creates licensing bottlenecks when the organisation scales. Architecture teams might choose elegant solutions that require specialised knowledge, creating operational overhead for platform teams. Procurement teams might negotiate volume discounts that lock the organisation into tools that don’t integrate well with the broader platform ecosystem.

Understanding trade-offs means trying to uncover these hidden costs early and ensuring decisions account for all perspectives. It means asking not just ‘does this solve our immediate problem?’ but ‘what are the second and third-order effects over time?’. The goal isn’t perfect foresight but systematic consideration of how decisions ripple through the organisation’s socio-technical system.

Have clear boundaries for infrastructure

Domain-driven design principles shouldn’t stop at application code – they should extend to infrastructure. Just as we create bounded contexts for services, we need clear boundaries for infrastructure ownership and lifecycle management.

Infrastructure should be bound to the service that depends on it. This principle simplifies lifecycle management and reduces coordination overhead. When infrastructure lives with the service code, teams can manage versions, updates and configurations as a cohesive unit.

This doesn’t mean every team manages every piece of infrastructure. Shared resources like API gateways or monitoring systems require different ownership patterns. But the principle provides clarity. If a Lambda function needs a DynamoDB table, that table’s lifecycle should be managed alongside the Lambda, not separately by a central team.

Clear boundaries also mean applying good service design principles to infrastructure. An S3 bucket used as a data store shouldn’t be directly accessible to multiple services any more than a database should be. These boundaries might seem obvious for traditional databases but are often ignored for cloud-native resources.

Use meaningful abstractions

Effective platform engineering requires multiple layers of abstraction that serve different constituencies. Resource abstractions handle the technical configuration of individual cloud resources. Organisational abstractions encode how your organisation deploys those resources. Implementation abstractions handle environment-specific variations.

These abstractions prevent concern leakage from one abstraction layer to another. As an example, organisational concerns should not leak into resource definitions. We’ve seen Terraform modules that hardcode environment names like “test”, “UAT” and “prod”, making them impossible to reuse in organisations with different naming conventions.

Good abstractions also reduce cognitive load. A database module might accept hundreds of parameters, but your organisational abstraction might expose only five, with the rest set to comply with security and operational requirements. This approach provides the flexibility teams need while maintaining consistency and compliance.

Add semantic meaning to data and infrastructure

Semantic meaning transforms infrastructure from opaque configurations into communicative systems. This principle extends beyond code into every aspect of platform engineering – commit messages, logging, resource naming and pipeline artifacts. Semantic versioning for IaC module changes provides the same benefits as semantic versioning for application code.

Teams can understand the impact of changes and make informed decisions about when to adopt updates. Combined with immutable releases, semantic meaning eliminates the variance and confusion created by cherry-picking and long-lived branches.

This principle also applies to data flowing through systems. Events should carry sufficient context to be joined and analysed in downstream systems. Rather than dumping data into lakes and hoping for future insight, semantic meaning enables immediate understanding and correlation.

Treat your platform as a product and technologists as your customers

Platform teams that build in isolation and expect adoption through mandate consistently fail. Successful platforms emerge through product thinking – understanding customer needs, gathering feedback and iterating based on real usage patterns. Your technologists are your customers and, like any customers, they have alternatives. They can build shadow solutions, adopt external tools or simply work around your platform. Winning their adoption requires understanding their constraints, pain points and preferences.

This product mindset extends beyond initial development. Platforms require ongoing investment in documentation, support and evolution. Teams that declare victory after initial deployment typically find their platforms abandoned as requirements evolve and alternatives emerge. Product thinking also means understanding context. Platform requirements differ significantly between organisations. What works for a startup won’t necessarily work for an enterprise. What works for a financial services company won’t necessarily work for a media company. Effective platforms emerge from understanding specific organisational context rather than copying generic solutions.

Person first

Drawing from a basketball philosophy, we apply the ‘person first’ principle to platform engineering. Individual contributors are more than their technical roles – they’re individuals with diverse experiences, perspectives and working styles. This principle recognises that technical solutions must work for real people in real contexts.

The best platform architecture means nothing if it doesn’t account for how people actually work, learn and collaborate. Person-first thinking means building relationships beyond technical interactions. Understanding someone’s broader context – their experience, challenges and goals – enables more effective collaboration and better platform design. It also creates psychological safety that enables honest feedback about platform effectiveness.

Design for safety

Platform safety encompasses both technical safety and operational safety. Technical safety means preventing catastrophic failures through appropriate guardrails. Operational safety means making it difficult for well-intentioned people to make mistakes. We’ve encountered numerous near-misses where infrastructure-as-code configurations could have impacted systems in production because someone committed incomplete configurations. Defining what pipelines should do is crucial, but equally important is understanding what they should not do.

Pipeline policies can prevent such scenarios by recognising when configurations would destroy stateful resources and requiring explicit approval or a manual intervention. Safety also means making correct usage obvious and incorrect usage difficult. If your platform requires specific folder structures for environment separation, implement mechanisms that enforce those structures rather than hoping teams will remember to change the right parameter that represents the folder structure. Effective safety measures are robust enough to prevent catastrophe but flexible enough to accommodate legitimate use cases. This balance requires understanding the actual failure modes in your environment rather than implementing generic safety measures.

Embrace AI with guardrails

AI tools can solve many of the ‘thousand paper cuts’ that platform engineering addresses. Documentation becomes easier to maintain, scaffolding becomes automated and complex configurations become more accessible. However, AI adoption requires careful consideration of where it adds value versus where it introduces risk. AI excels at reducing toil in development processes but shouldn’t be used for mission-critical or governance tasks where deterministic outcomes are required. We use AI to generate infrastructure scaffolding, reducing the overhead of our abstraction approach. Previously, creating the multiple layers of abstraction required significant manual effort. AI eliminates this friction while maintaining the human oversight necessary for quality and compliance. The key is maintaining humans in the loop for critical decisions while leveraging AI to eliminate repetitive, error-prone tasks.

In short, this approach amplifies human capability rather than replacing human judgment.

In summary…

These principles work in concert to create platforms that genuinely accelerate organisational capability. They’re not independent guidelines but interconnected concepts that reinforce each other. Understanding trade-offs improves decisions about variance elimination. Good abstractions enable semantic meaning. Product thinking ensures safety measures enhance rather than hinder developer experience. Platform engineering represents an evolution in how we think about enabling development teams. By applying these principles thoughtfully, organisations can build platforms that truly serve their mission – accelerating value delivery while maintaining operational excellence.

The goal isn’t perfect platforms but adaptive ones that evolve with organisational needs, while maintaining their core purpose – eliminating friction in the path from idea to customer value.

Posted by Davin Ryan . Jun 30.25

We are regularly reminded that large language models (LLMs) will revolutionise how we work, automate complex tasks and enhance productivity across industries.

> Read

Platform engineering principles that actually work for teams

Align platform and system architecture

Eliminate variance, but cater for preference

Understand trade-offs

Have clear boundaries for infrastructure

Use meaningful abstractions

Add semantic meaning to data and infrastructure

Treat your platform as a product and technologists as your customers

Person first

Design for safety

In summary…

MoreIdeas

Is AI hurting team dynamics?

AI-powered software migration: how one team transformed a six-week challenge into a four-week success story

Teaching AI in an age of uncertainty: the ethical dilemma of educational responsibility

Product ops for AI initiatives: moving beyond AI mandates

Mastering Model Context Protocol (MCP): how to give your AI Code Assistant tools to use

More
Ideas