The Lean and Antifragile Data Centre

Home / Blogs

The Lean and Antifragile Data Centre

	By Martin Geddes Founder, Martin Geddes Consulting Ltd
	September 12, 2016 Views: 8,258 Add Comment

Cloud is a new technology domain, and data centre engineering is still a developing discipline. I have interviewed a top expert in cloud infrastructure, senior leader and consultant, Pete Cladingbowl of Skozo. He has a vision of the ‘lean’ data centre and a better kind of Internet for users to reach it. He also has a roadmap for how these can be practically realised. The key is to apply established theories of value flow from more mature industries.

* * *

MG: How did you become an expert in systems of value flow?

Peter Cladingbowl, Senior Leader and Consultant – Digital Infrastrcture and Supply Chain, Skonzo LtdPC: As a young student engineer I worked in a coal mine in South Africa. There were several of us doing projects, all of which had the goal of improving the productivity of the mine. My task was to produce a maintenance training plan for the conveyor belts that moved the coal from the rock face to the railway truck, which delivered coal to the customer. To justify the additional cost resulting from improved maintenance, I studied the impact of the breakdown of that flow on the mine’s overall performance.

At the end of the course, other student engineers presented their findings on how to improve the cutting at the rock face: doing it faster by looking afresh at the geology, and designing new machines to cut coal. Naturally, the managers were all very interested in how much more coal they could cut by spending money on new machines and geological studies.

My team was last. I put up an overhead projector slide, which calculated the amount of money the company was losing from breakdowns of the conveyor belts. I merely had to look outside the window, watch a breakdown, and see how no coal went into a truck. I then worked out how much throughput was being lost in terms of revenue. When the conveyor stopped everything behind it did too, so the amount the machines could cut at the rock face was constrained by the rate at which the coal was loaded onto trucks by the conveyor belts.

Our recommendation was that an improvement to the maintenance regime would have a big reward for only a small cost. The Mine Captain stopped the meeting, called in his managers, and put this into action right away! The capacity of your machines doesn’t matter if you can’t pull the resulting value through the complete system.

This is not the only example. At another mine there was a bonus on blasting gold ore, but the processing plant couldn’t process it fast enough. As a result, the company was paying over-the-odds merely to grow a mountain of rock.

What I brought to the cloud industry, and helped me to make a valued contribution, were such concepts of flow. In particular, I have learned how to balance flow through systems, and understand the relationship between supply and demand.

MG: In what capacity have you applied this learning in cloud and telecoms?

PC: In the ICT sector I have occupied a number of senior executive roles, including SVP Engineering & Operations at Interxion, and Global Crossing (now Level3) for EMEA. I have been responsible for a breadth of functions: engineering, operations, customer service, and IT. More recently I have consulted to IXcellerate in Russia on data centre design as CEO and founder of Skonzo Ltd.

My route to these senior positions was through being a project leader and advisor for the design, construction and operation of multiple digital supply chains and cloud infrastructures. I have been responsible for the operation of infrastructure all the way from global subsea networks, through 1000 site wide-area networks (WANs), to 100,000 seat hosted VoIP platforms.

This has exposed me to a wide range of problems at the design, build, and operate stages. These occur at every layer of the cloud stack, from pure colocation IaaS (like generators and cooling systems) to hosted software platforms.

MG: In your experience, how have other industries adopted flow-based principles?

PC: Primary industries, like oil and gas, have long known that flow is important, all the way through the system. In the secondary manufacturing sector we have also seen a transformation towards managing flow.

For instance, early in my career while doing precision manufacturing, I found that materials requisitioning software assumed an infinite capacity of machines to process it. This resulted in flow problems that demanded a different way of thinking. So we learned from the Toyota Production System (TPS) about flow and the importance of reducing work in progress (WIP).

I also learnt valuable lessons from studying Kaizen (continuous improvement). One Kaizen expert told me a story of when he was in Japan and took a tour of a factory.

He saw one of the machines and remarked to his host: “I remember the old X51—they produce 50 widgets a minute. We have the new X500 and it does 100 a minute!”. The Japanese host quietly mentioned that their “obsolete” X51 had been continually improved and now produced 120 a minute. This underlined to me the importance of continuous improvement. You must never stop improving and continually reducing waste!

Knowing what to improve is something that I learnt from the Theory of Constraints (TOC). Every system has a flow constraint, and improving that improves the entire system. Improving anything else is a waste of time and money.

Managing the entire system as a set of flows is the key to improving customer satisfaction, increasing throughput and reducing costs. Achieving all three at once is what makes managing flow (rather than capacity) so powerful. This is done by controlling the flows through scheduling and buffer management.

In these other industries product engineering was fairly mature, especially reliability engineering and how to make things safe. What was still developing was the application of these new flow management methods. They give a holistic understanding of systems and the flow through them, by understanding the relationship between supply and demand.

MG: How did you begin to apply these ideas to telecoms?

PC: In the oil industry I had been running wireline networks, both analogue and digital. When based on the oil rig, we collected vast amounts of data from the oil well, processed it, and transmitted it to the customer. The first computer I used was a PDP11 with 2Mbit of memory that was as big as a whole blade server today.

My first data transmission from the North Sea to Houston was at 9600 baud. The importance of computing power and quality network throughput to efficiently delivering product to the customer was ground into me during many long nights slowly transmitting and retransmitting information.

From the oil industry I moved to manufacturing and supply chain management. Here Manufacturing Resource Planning (MRP) computers were predominantly used to decide the production schedule, i.e. what task should be done by whom and when.

The flaw in MRP schedules was that they assumed infinite capacity, so that there was no constraint. We tore those plans up, and gave them Kanban boards instead to visualise workflow and limit WIP for production lines making standard product. We applied TOC’s Drum-Buffer-Rope in the job shops where the product changed frequently. Whilst the MRP system was good for a high level plan of what to order, the software was only as good as the rules of the method, which didn’t manage the flow.

I then worked at Racal Telecom on IP networks and X25. I was tasked with taking a new product to market that did voice and data over one link (thus being ahead of its time). I worked my way through network operations, IT, operational support systems, workflow management and inventory systems. I found that what really mattered was network performance management, i.e. managing the flows.

Then Racal was bought by Global Crossing, and I joined the team managing their new global subsea networks. When Global Crossing entered Chapter 11 bankruptcy we had no capex, and the headcount went down from 14,000 to just a few thousand.

At the time the network had only just been completed and reliability was poor. I set a target of 99.999% availability on DWDM/SDH networks. People thought we were crazy, but we achieved it. It required a continuous improvement methodology: we measured, isolated and managed the faults.

That target and approach was then applied to the IP network which is now (via the acquisition by Level3 of Global Crossing) the biggest IP network in the world.

MG: How good is the telecoms industry at managing information flows?

PC: When I came to telecoms I was struck by the lack of established engineering principles. There was little idea of how to measure flow, or how to make a safety case that the flow would be sufficient to meet demand. The telecoms industry hasn’t yet had its lean or total quality management (TQM) revolutions. That means its services are often disconnected from customer outcomes and (billable) value.

The specific problem I found was when we came to IP WANs and managed corporate networks (VPNs). I really struggled to understand how to get the right application outcomes. There were constant battles between the customer and sales, or between IT and the customer. Every time the solution was more capacity, as it was the only tool you had, since you couldn’t see or understand the flows.

There were various attempts to manage flow better—via MPLS, RSVP, diffserv—which helped a bit. Many customers wondered why they didn’t stay with old-fashioned (and expensive) ATM. Others just bought a fat (and expensive) Ethernet pipe plugged into a fat (and expensive) MPLS core. When it came to new applications like VoIP, they built a new (and expensive) overlay network.

So I was always puzzled, and when I started reading your work in the early 2000s, I realized I was not the only one wondering what was wrong. Why was this industry all about (expensive) “speed”, and yet was not making much money, despite its central economic role?

From other industries we knew that the underlying issue was knowing the constraint and using buffer management to manage the flows according to the constraint. When you make capacity the constraint you can only add more of it, not manage it. There were also plenty of clever people in telecoms, many of whom wanted better tools.

The problem then was always the business case, as quality didn’t matter enough. The attitude was to just pay the customer some credits if they complained and sell them more capacity. Until now, there was no real market for quality of service beyond “quantity of service”.

MG: How did you make the leap to applying flow concepts to data centres?

PC: I had the opportunity to build some data centres with Global Crossing and then Interxion, whose customers had high-end demands, their data centre was “mission critical” to them. There were capital projects building new infrastructure. I was working on how the data centre and connectivity related, and the connections and cross-connects at Internet Exchanges (IXs).

Interxion’s customers understood the importance of high reliability and availability. Supplying these needs exposed the relative immaturity of telecoms data centres compared to other industries.

QoS in data centres largely comes down to “is it up or down?”—the focus being on the machines, not the flow of value they enable. Robustness is achieved through (expensive) redundant machines, and improvements in availability is achieved through adding yet more redundancy. This overreliance on capex to solve flow problems should sound familiar by now.

My job at Interxion was to help improve customer service and reduce costs, just as in the other industries I had been in. This means fulfilling value-added demand, and not creating failure demand (like rework in a factory, or retransmits in a network). Being able to differentiate these meant understanding the demands for information flow, and how the supply chain responded to these demands.

Both telecoms and data centres are relatively new industries, deploying complex machines at scale. These machines aim for robustness through redundancy. The standard definition of “success” is that the customer service is not interrupted if you lose an element in the mechanical infrastructure.

Robustness alone is not enough. We also need to be able to anticipate failure, so as to be able to prevent it. We also need to be able to proactively monitor, respond and restore service. If we do these things, we will have good infrastructure that is not just robust, but also resilient.

MG: Why do you say the traditional engineering approach to resilience insufficient?

PC: Many people will have heard of Nassim Taleb’s “black swan” events, which are outliers. Taleb is in the business of risk assessment, which involves judging the probability of something (bad) happening. Any risk assessment takes the likelihood of an event and its impact, and multiplies them.

The standard data centre engineering approach is to take things that are high impact but low probability, put them to one side, and not worry about them. These “improbable” things that don’t get addressed then all happen on Friday at 4pm, Sunday around 2am, or when on holiday and contactable.

These failures can be prevented, and the way you do that is to use the perspective of flow, which is a different perspective to the individual machines. Don’t get me wrong, we still care about the technology, people, and processes. These silos are good and necessary, but they define static forms of quality baked into machines, skills, and methodologies.

The purpose of the system as a whole is to enable flow, so we need to see how these factors interact with their context to satisfy a dynamic quality requirement. From a flow perspective, we want a system that automatically deteriorates in an acceptable way in overload. We know there is variability in demand, and our job is to be able to absorb it (up to a point).

This ties into resilience, but is different in a critical way. How you increase resilience is a process of continual improvement at human timescales. In contrast, we are managing dynamic quality, often at sub-second machine operation timescales. This is a result of how people, process, and technology all interact.

Taleb puts it nicely when he talks about “black swan” events that have low probability and high impact. He describes “antifragile” systems that get stressed by these outlier events, and learn how to respond appropriately. How did a demand or supply change impact the flow? Does it make any difference, and what’s the right way to adapt?

My opinion is that we must adopt lean and antifragile engineering principles in data centres and networks if we are to meet future customer demands.

MG: Can you give me an example to bring this to life?

PC: These concepts of flow need to be seen holistically, as it’s not just about packet flows. For instance, if you run a data centre too hot, then the cooling equipment can’t cope. If you can’t flow the heat out, you have to turn machines off.

It’s a system which has an environmental context; energy in as electricity; electronics to compute and connect; and then mechanical and heat energy outputs. There are energy flows that connect these machines together. It is a single system with dependent events and variability, just like many other engineering systems.

Let me give you an example of how one seemingly small thing can make a huge set of expensive infrastructure very fragile. Energy flow through a data centre often depends on generators when there is a problem with gird power. During one “flow audit” I did I was being shown the big shiny generators and fuel tanks, and told how much capacity they had.

But I focus on the flow, not the capacity. Generators need air flowing in and exhaust flowing out, and I followed these flows from the generator room to where they came in and out of the building. There were several hazards to these flows that could very easily block the flow, and could bring a halt to the whole system. These hazards were removed, or monitoring put in place to detect if they became a problem, so the entire system was made less fragile.

Other people would have focused on the robustness of the machines, and their redundancy level. My “lean” eye was on the flow of the system as a whole, and its system constraint was these air and exhaust ducts and preventing a black swan event occurring.

MG: Many data centres service Internet users. When you look at today’s Internet, what do you see?

PC: When we look at the Internet as infrastructure, we see significant achievements in what we can do for individual networks and data centres. The problem is the system as a whole has fundamental flaws when it comes to isolating users (and their flows) from each other.

The Internet is a hot bed of innovation with a constant steam of new business and services being created. Yet it has some serious limitations in terms of security, scalability and performance. These problems come from its core engineering, a single address space, lack of sufficient IPv4 addresses, its “best endeavours” offer of quality.

The Internet we have today is a first generation technology. Why would you expect it to perform better? Basic building blocks like Border Gateway Protocol (BGP) were “designed” on a napkin over lunch. That was great for the need at the time and it has severed us very well, but is not good enough for the future. It’s time to move on.

The Internet needs a rebuild. We don’t need to throw it away, but rather need to overlay it with new structures. For instance, a basic structure is when we connect to it, we are forced to be connected to everything. Every device is out there ready to be trolled and attacked. We need new methods so that we can be connected to just one place.

MG: What might a future Internet look like?

PC: We see with Recursive Internet Architecture (RINA) a new kind of internetwork, one that naturally aligns to social units. There is the subnet of the individual’s body and personal area network (PAN), their home and family, then the school or workplace. These software-defined subnets are protected in RINA: you can’t get in without being invited.

You can connect to more subnets at wider scopes as needed, more like how people always have connected in the physical world. New architectures like RINA mean we can move back to protecting ourselves, and our friends and families. These are the true “social networks”, not from an application perspective, but reflecting social and geographic connections between people.

A real Internet is a internetwork of independent subnetworks, at all scales. It should not be lots of networks glued edge-to-edge into one “flat” ubernetwork, and all forced to into a protocol monoculture. The result with the Internet is an atrocious security situation: a child on WiFi is exposed to everything in the world, all of the time, with no choice or control.

Soon you will have a PAN over you and inside you, to monitor your health and wellbeing. You don’t want that to connect to any and every LAN as you walk down the street. There has to be some security around which access points it associates to and can route through. That means we need to have a new fundamental organising unit that has built-in security.

I believe that the Internet of the future will have strong perimeter security aligned to our physical and social reality, with layers of security to protect us from attackers. This is a proactive security, different from a reactive firewall. Today you are connected to everything until you disconnect. That model has to change.

In future you will decide whether to connect your PAN to your family network. Going wider, you can choose to link your family network to your sports club one, to your school, to your company. You can even connect it to Google’s security domain, no problem. But then you (or your avatar) decides whether to establish an association with the Web page server that Google has proposed. It needs to be an opt-in for a packet flow to even be possible so control is with the end user not the service or content provider.

This new paradigm means we have to start thinking of network security as being inseparable from physical security. Today we are happy to connect to the Internet, but that’s like handing the keys to your digital front door to everyone in the world. We can’t fix things by adding layers of protection to the “rooms” inside the virtual building.

MG: Security needs to be designed-in. What else needs to change?

PC: There will be other parallel changes we see. For instance, we are running out of IPv4 addresses, and IPv6 routing isn’t going to scale due to fundamental design problems. So the core of the current Internet is going to crash due to router table size growth, and we’re seeing it already happening.

The way we manage quality will also change. At any one instant there are a gazillion “back cygnets” of buffers filling inside the Internet, and you can’t see them. The result is an embarrassment from an engineering perspective—we can’t do basic things like reliable voice and video. Whilst we can send a probe to Mars, we can’t flow a reliable video stream to a house. Who doesn’t suffer from the “mother buffer”, that circle counting down the time needed to fill the buffer?

The places from which we manage these flows will also shift from being purely “edge” based. We will see content delivery network (CDN) caches and home gateway and media servers take a greater role, for instance, in orchestrating flows by integrating the compute, content and control..

The idea of one monolithic Internet is also disappearing fast. With new architectures like RINA we can attach to any number of overlay (inter)networks simultaneously. There will be content delivery networks which define their own subnets. Communities will build private ad-hoc networks. There will be different types of transmission capacity used as building blocks, not just fixed broadband lines and cellular connections.

In the future we will connect to parts of the world for a period of time for a purpose, not all the time to everyone. Integrity, availability, security, performance, privacy and confidentiality will become engineered-in, rather than afterthoughts.

MG: This all sounds like a radical change to cyberspace! Where might it lead us?

PC: Let’s take an example. In RINA there is the idea of “virtualised communications container” called a Distributed Inter-process communications Facility (DIF). As a rough approximation, what a hypervisor does for cloud computing, this does for cloud communications. The control over these DIFs (a kind of software-defined subnetwork) will become much more important.

These virtual subnets define a new kind of cloud geopolitics, with its associated power games, shifting alliances and borders. You can think of it as being the digital version of the emergence of nation states, where we had kings who conquered countries, so they could rule more territory. This “virtual” terrain, defined by DIFs, likewise offers an extremely powerful organising mechanism that will allow individuals and communicates to self-organise rather than be connected by corporations.

MG: If we don’t adopt new “lean” and TQM techniques for the Internet, what will happen?

PC: We are already seeing lots of problems with the Internet being used to access cloud applications. Users can’t tell which broadband or cloud service is right for their needs. When things aren’t right, it is very hard to isolate faults. Service providers can’t determine operational tolerance levels: they don’t know if they are under-delivering quality, or over-delivering.

There are several underlying causes. From a technology perspective, we have problems of measuring and managing flow at short timescales. These “high-frequency trading” issues particularly affect high-value transactions and applications.

We are also riding on the back of networks that connect clouds together. The way that the Internet is designed, with a single address space, means these cannot function well enough. This is especially true with video sucking more resources up. We also have a lot of computing power in the wrong place: if you use Facebook in Mongolia, it is served from Dublin.

From a commercial viewpoint, service providers don’t have enough skin in the game, so don’t feel the pain when their customers’ applications aren’t secure or don’t perform. They lack the necessary understanding of the relationship between the technology and customer value.

Because of this, telcos use contract terms to hide their engineering failure. Breakage of the service level agreement is treated as a cost of sale.

MG: What do we need to do to improve the situation?

PC: When you move your applications to the cloud, we need to share whatever resources we have in the network and data centre. At the moment we can’t multiplex it all as we’re so poor at scheduling. To improve the situation, we have little choice but to schedule resources better. The core idea of “lean” is to manage flow to meet customer needs in a sustainable manner.

We simply can’t keep throwing more and more capacity at scheduling problems, or build a CDN or overlay network for every application. Our current path results in unsustainable economics for the cloud. Data centres are already overtaking aircraft as one of our biggest energy consumers. The universe we live in doesn’t scale the way people assume and hope.

We leave the customer experience to chance, and then get bitten by frequent “black cygnets” and beaten up by occasional “black swan” events. The result of our engineering failure is a whole industry sector dedicated to cleaning up the mess.

For example, WAN optimisation is a $20bn industry that should not exist. They are doing the traffic scheduling and shaping that operators themselves failed to do. The cost of enterprise application performance monitoring runs into billions, much of which should not be needed.

Twenty years ago we saw the “rise of the stupid network”, and that now needs to go into reverse. We need a new kind of network where there is AI inside driving resource scheduling. This is focused on delivering user outcomes, and making the best use of scarce resources.

MG: If we do adopt new approaches, what might they look like?

PC: What needs to be different is clear: we need a demand-led approach that applies the quality management principles established in other industries. We need to think about flows, resources, supply/demand balancing, bottlenecks and trade-offs.

We also have much to learn from the military who have Information Exchange Requirements (IERs). There need to be tighter flow “contracts” between the supply and demand side of both telecoms and data centre services. You might think of this as the “Intercloud”, where IXs moving from peering connectivity to providing managed performance along supply chains.

For service providers to move forward they must define the user experience they aspire to support. This means being fit for purpose: doing the failure modes analysis, managing and mitigating the failures, and having the right service level agreements to restore service. They need to better understand what is genuinely under their control, versus being third party, and whether the latter can be trusted.

As enterprises and consumers move to the cloud there has to be a more robust due diligence process for application deployment. Buyers need new and better cloud comparison services to help them select the right computing and communications offers, and to configure them to deliver the desired outcome.

The “smart network” (or data centre) of the future will also reflect broader trends towards machine learning. We have clever machines playing chess, or getting close to passing the Turing Test to emulate human conversation. What tends to grab the headlines are these examples of application-level machine learning. Your phone might know if an email is worth vibrating for, based on context, location, calendar.

However, AI is a whole host of things, often mundane. Rather than “supercomputer AI” pretending to be a person, we will have different types of AI for different circumstances. It’s not like Skynet, even if that is a worry; it’s more like the “fuzzy logic” in your washing machine, or a smart fridge that knows usage patterns, and reminds you to buy milk on the way home.

The network of the future will also look at patterns, and those will drive different choices. There will be network-driven UX intelligence—say “smart hunt groups” in the context of voice, or detecting suspicious behaviour around your home. A lot of the application of machine learning will be for ordinary things like packet processing. What autocorrect does for your typing, the network will do for class-of-service assignments for data flows.

MG: What are the potential benefits of a new “lean” approach to networks and data centres?

PC: There is a tremendous opportunity for organisations to collaborate and coordinate better. We can radically improve value flow internally, as well as between customers and suppliers. The result is greatly reduced waste throughout the whole economy.

The future of digital supply chains will reflect the way we have built more sustainable physical supply chains. Principles of “just in time” and “right first time” were enablers of new “pull” systems; these have yet to be applied in telecoms and data centre design.

These concepts came out of the experience of the 1950s. There was a huge surplus production capacity in US from wartime, and industry shifted to making consumer goods. Back then, enterprises had a pre-stocked product to sell, so had to create ads to stimulate demand to sell the goods they had (over-)produced.

Telecoms today is very much a “push” industry, with similar supply-side economics. The core belief is selling more capacity is the route to progress and growth. What should be driving growth is a balance of flow between supply and demand. When we balance flows, the supply chain becomes more sustainable.

Ultimately the limit we face is energy flow management, and a sustainable resource model for a finite world. Ideas like “net neutrality” work against optimal flow, demanding unbounded generation of input resources. The constraints of the world force us to face up to issues of sharing and scheduling resources better.

If we solve these flow problems, we can begin to re-imagine what kind of society we would like our children and grandchildren to inhabit. There is a real possibility of using technology to achieve fuller employment and better life. By reducing the impact of networks and data centres on the planet and its resources, more people can specialise in whatever bring joys to other people—be that growing bonsai trees, or surfing at the beach.

By Martin Geddes, Founder, Martin Geddes Consulting Ltd

He provides consulting, training and innovation services to telcos, equipment vendors, cloud services providers and industry bodies. For the latest fresh thinking on telecommunications, sign up for the free Geddes newsletter.

Visit Page

Filed Under

Comments

The Weekly Wrap

More and more professionals are choosing to publish critical posts on CircleID from all corners of the Internet industry. If you find it hard to keep up daily, consider subscribing to our weekly digest. We will provide you a convenient summary report once a week sent directly to your inbox. It's a quick and easy read.

I make a point of reading CircleID. There is no getting around the utility of knowing what thoughtful people are thinking and saying about our industry.

VINTON CERF
Co-designer of the TCP/IP Protocols & the Architecture of the Internet