      Author's Profile:SRE Expert for Huawei Cloud SRE Lab working from Switzerland Research Center. >10 years of experience in building wide range of reliable system including embedded, cloud native and base cloud services. Specialized in stabilizing unstable systems and demistifying heizenbugs in large systems.


      Cloud is a “hardware” for its customers. Same as they expect HDD to work, they rely on the cloud to do the right thing all the time. But nothing is perfect and everything breaking all the time — we just live with it. But the way to deal with risks depends on the risk and cost. It is about 1% chance for a hard drive to die during a year. That’s just a fact, same as 50 MiB/s reading speed or 10ms seek latency. During a design phase system architect uses these facts and system requirements to make a system, balancing out various costs (like complexity, building time, maintenance costs, potential risks).


      That means that knowledge of critical characteristics of HUAWEI Cloud services are essential to users. They are core of the cloud design: users build their systems with these promises in mind. That lead to two consequences: the promises set is the reason for users to choose you, and how the promises held is what retains users. Promise of 99.9999% uptime may attract lot of users, but they would expect that system won’t get down for even one minute in a year — and longer downtime would make customers unhappy, they will lose trust not just to this one product, but to the whole Cloud and Huawei — and the users trust is the most valuable currency in the world.


      By declaring realistic promises, we empower our users to build systems they want with any characteristics they need. We publish our promises as HUAWEI Cloud SLA. Our users are free to build their products on top of our Cloud and expect that within given parameters systems will behave as promised. It is our responsibility to ensure that no matter what we do, system is working as expected.


      SLA is the top of observability pyramid. It’s a subset of system guarantees we ready to promise to the external users and back them financially. 
But how to know that system is working as expected? How to empower another service with information needed to build on top of our system? That’s where universal communication language of SLO is come in.


      The observability pyramid is based on measurable metrics (events, timeseries) of system behaviour. Every system should have them defined, and if there is no metrics that measures what’s matter to users, monitoring strategy needs a refinement.



      By choosing the metric that expresses what’s matters most for the system, chosen metrics get promoted to SLI: Service Level Indicators. These metrics should be central for systems’ monitoring, they should be published and available to the support engineers. They could also be published to the users. Often, these metrics could also be collected by users themselves (e.g. virtual machine uptime is clearly available from the machine itself, availability could be measured by the external service etc), so these metrics are externally verifiable and thus should be open and transparent. Validity of the verifiable metrics is also important to increase overall trust in the system and cloud in a whole.

      通过选择对系统最重要的指标,选定的指标将升级到SLI (Service Level Indicators):服务级别指标。这些指标应该是系统监控的核心,它们应该发布并给予工程师支持,它们也可以发布给用户。通常,这些指标也可以由用户自己收集(例如,虚拟机正常运行时间可以从计算机本身清楚地获得,可用性可以由外部服务衡量等),因此这些指标是可外部验证的,并且应该是开放和透明的。可验证指标的有效性对于提高客户对整个系统和云的整体信任也很重要。

      Once you have SLI metrics, you should define the desired level of each of the metric that you aim to keep. The chosen levels are your SLOs: Service Level Objectives. E.g. while measured availability may be 100%, your SLO may be only 99.9%, so if the service gets unavailable for 5 minutes, your SLO is still met. Or if you want request latency be under 40ms, then some slowness during high peak will keep you within the budget.

      一旦有了SLI指标,就可以定义目标保留的每个指标的所需级别。选定的级别是SLO (Service Level Objectives ):服务级别目标。例如,虽然测量的可用性可能为100%,但SLO可能只有99.9%,因此,如果服务在5分钟内不可用,SLO仍可满足。或者,如果您希望请求延迟低于40毫秒,那么在高峰期间,一些缓慢将使您保持在预算范围内。


       Whenever you measure your indicator, it’s just a point. The SLO should not only have the level set, but also fix the time period for the computation. The “uptime is 99.99%” is underspecified and not measurable. The almost SLO definition is “uptime is 99.99% over 1 month”. To remove ambiguity, “uptime is 99.99% over any 30 days”.


      Once you set the SLO level and measurement period you’ve got SLO budget. E.g. 0.01% over 1 month is 4m23s. As long as your observed downtime is less than 4m23s over last 30d, your service is fine and meet the SLO.


      And the top of pyramid – SLA – describes everything together and also defines the compensation in case of violation. For example, Huawei Cloud Service Level Agreement for Elastic Volume Service first defines the SLI: “Service Unavailability” as “Service Unavailability means the Elastic Volume Service (EVS) system log shows that EVS was inaccessible for one (1) or more consecutive minutes because of Huawei Cloud” and then sets the SLO: “Huawei Cloud will use commercially and technically reasonable efforts to make the Service Availability Rate of ELB per Service Cycle not less than 99.95%” and the guarantee: “If we fail to meet the commitment above on ELB Service Availability Rate, Huawei Cloud will provide you with the below service credit”.

      金字塔的顶端—SLA—描述了一切,并定义了违规时的赔偿。例如,华为云服务等级协议对于EVS (Elastic Volume Service) 弹性卷服务,首次定义SLI“服务不可用服务不可用是指弹性卷服务(EVS)系统日志显示由于华为云EVS连续1分钟或以上无法访问,然后设置SLO华为云将在商业和技术上做出合理的努力,使ELB每个服务周期的服务可用性率不低于99.95%”,并保证:如果我们未能履行上述ELB服务可用性率承诺,华为云将为您提供以下服务积分

      Ultimately, the SLO budget left is the single number that describes quality of a service. It doesn’t matter what your service is: is this encryption key storage, database or VPN service, is it 99% of availability over month or 99.999% durability over a year: percentage of SLO budget left will allow you to choose between prioritization of new business features and stability fixes for the next sprint. You probably don’t want to do non-critical update to your system if you only have 10% of budget left, but it’s fine to take some risk and try to deliver additional feature if you still have 99% of budget left.


      When the system has several APIs with different SLO, each of them has their own budgets. It is possible and feasible to make separate decisions based on individual SLO budget left, but advisable to also define a combined metric, showing the overall service health. The combination takes weighted inputs to prevent less critical / known beta features instability from affecting overall system health view (i.e. exhaustion of 80% of budget for VM creation is more critical than exhaustion of 80% of VM snapshotting budget).



      Design of SLO set for the service should be done aiming to highlight service strengths and limitations. While service may have unique requirements, there are few common SLOs:


      1.    Availability – uptime, reachability, you name it. It shows how users can rely on fact that if they would want to connect / send request to the system it will respond. The better system availability, the less users need to worry about handling connection errors, timeouts etc accessing your system – but the more you need to worry about potentially disrupt customers with any changes on your side.

      1.   可用性正常运行时间、可达性。它显示了用户如何依靠这样一个事实,即如果他们想连接/发送请求到系统,它将响应。系统可用性越好,用户就越不需要担心处理连接错误、超时等访问系统——但服务提供者就越需要担心服务的任何更改可能会扰乱客户。

      2.    API Success rate – defines system reliability, setting expectation on how much users need to worry about their product graceful degradation in case of system misbehave. It’s important to keep in mind that error returned to invalid input is valid result – but error returned to valid input is system error. Success rate should be defined as rate of expected answers, not just rate of HTTP 200 responses to the overall input.

      2.API成功率定义系统可靠性,设置用户在系统行为不端时需要担心其产品优雅降级的预期。重要的是要记住,返回到无效输入的错误是有效的结果,但返回到有效输入的错误是系统错误。成功率应定义为预期答案的速率,而不仅仅是对整体输入的HTTP 200响应的速率。

      Every data storage system (database, blob storage, filesystem and so on) should also define:


      3.    Durability – guaranteed % of data will be correctly available after significant period of time (the percentage could be low or over low period of time for cache system, but should be high and over long period for permanent storage)

      4.    Access speed – minimal read/write speed for the single operation or user.


      The trickiest metrics to have SLO defined are latency metrics. As system depends complexity grows it become harder to predict and guarantee its behaviour. When system doesn’t set performance/latency expectations, users just run their test and use measured performance as baseline, which lead to wrong decisions and makes users unhappy every time your system performance deviates for their use-case. While it’s easy to provide average response time, usually it’s not very helpful to know how system will behave under load – knowing that system will answer in 100ms or less on 99% of requests is better than knowing the system has 50ms average response time. Typical way to work over is to set latency expectations in fine grained way:


      5.    API Latency – sets latency for the trivial requests or size-bounded requests;

       6.    Latency buckets – as alternative, latency could be defined for various request or response data size buckets


      7.    Stages latency – defines performance in terms of sequence:

      a.    Operation start latency (parse request, enqueue, until the first byte read etc);


      b.    Processing bandwidth


      c.    Operation end latency (commit last byte, close file etc).

      Systems that has or provide long-running operations should be very careful with setting latency SLO or processing bandwidth, as usually it’s very hard to find the relevant and strongly correlated characteristics of requests to set the latency boundaries. But users will still try to rely on observed system behaviour for their queries/requests. Practically that means that system should at least define the worst-case scenario expectation (e.g. “1GB of data upload should take less than 5 minutes”) – but focus on providing stable performance of the system. E.g. you can’t define processing performance for various complexity of the input, but you can maintain similar performance across various CPU platforms you use.


      By finding the right metrics and setting the right goals for them you communicate with your customers what your system can and can’t, you shape their expectations and enable them to build anything they can imagine on your platform – without hurting your system operations or undermining users’ trust. Thus, SLO is a universal language the system architects communicate with each other, empowering and promoting best possible designs, meeting every ones’ goals as simple as possible – but not simpler.


