通过SLO塑造可靠性 Shaping reliability via SLO——华为云SRE外研所专家安东·费多罗夫
来源:《华为云SRE确定性运维专刊》第三期
Author's Profile:SRE Expert for Huawei Cloud SRE Lab working from Switzerland Research Center. >10 years of experience in building wide range of reliable system including embedded, cloud native and base cloud services. Specialized in stabilizing unstable systems and demistifying heizenbugs in large systems.
作者简介:安东·费多罗夫,华为云SRE外研所专家,有10年构建各种可靠系统的经验,包括嵌入式,云原生和基础云服务,以及从事大型系统中不稳定系统的稳定和除雾化。
Cloud is a “hardware” for its customers. Same as they expect HDD to work, they rely on the cloud to do the right thing all the time. But nothing is perfect and everything breaking all the time — we just live with it. But the way to deal with risks depends on the risk and cost. It is about 1% chance for a hard drive to die during a year. That’s just a fact, same as 50 MiB/s reading speed or 10ms seek latency. During a design phase system architect uses these facts and system requirements to make a system, balancing out various costs (like complexity, building time, maintenance costs, potential risks).
云是其客户的“硬件”。与他们期望硬盘工作一样,他们一直依靠云做正确的事情。但没有什么是完美的,一切都会破裂——我们只能接受它。应对风险的方式取决于风险大小和成本。硬盘在一年内死亡的可能性约为1%。这只是一个事实,与50MiB/s的读取速度或10毫秒的寻道延迟相同。在设计阶段,系统架构师使用这些事实和系统要求来设计系统,平衡各种成本(如复杂性、构建时间、维护成本、潜在风险)。
That means that knowledge of critical characteristics of HUAWEI Cloud services are essential to users. They are core of the cloud design: users build their systems with these promises in mind. That lead to two consequences: the promises set is the reason for users to choose you, and how the promises held is what retains users. Promise of 99.9999% uptime may attract lot of users, but they would expect that system won’t get down for even one minute in a year — and longer downtime would make customers unhappy, they will lose trust not just to this one product, but to the whole Cloud and Huawei — and the users trust is the most valuable currency in the world.
这意味着了解华为云服务的关键特性对用户来说是必不可少的,它们是云服务设计的核心:用户在构建系统时考虑到这些承诺。这导致了两个后果:承诺即是用户选择你的原因,而承诺的持有方式是留住用户的原因。99.9999%的正常运行时间承诺可能会吸引许多用户,但他们会期待系统一年内不会有甚至一分钟的停机——更长的停机时间会让客户不高兴,他们不仅会失去对这一款产品的信任,还会失去对整个云和华为的信任——用户信任是世界上最有价值的货币。
By declaring realistic promises, we empower our users to build systems they want with any characteristics they need. We publish our promises as HUAWEI Cloud SLA. Our users are free to build their products on top of our Cloud and expect that within given parameters systems will behave as promised. It is our responsibility to ensure that no matter what we do, system is working as expected.
通过宣布可实现的承诺,我们授权用户构建他们想要的具有任何特征的系统。我们公布我们的承诺华为云SLA。用户可以自由地在云上构建产品,并期望在给定的参数内,系统将按照承诺的方式运行。我们有责任确保无论我们做什么,系统都能如期运作。
SLA is the top of observability pyramid. It’s a subset of system guarantees we ready to promise to the external users and back them financially.
But how to know that system is working as expected? How to empower another service with information needed to build on top of our system? That’s where universal communication language of SLO is come in.
SLA是可观察性金字塔的顶端。它是系统保证的一个子集,我们准备向外部用户承诺,并在财务上支持他们。但是,如何知道这个系统正按预期工作呢?如何为另一个服务提供在我们的系统之上构建所需的信息?这就是SLO通用沟通语言的出现。
The observability pyramid is based on measurable metrics (events, timeseries) of system behaviour. Every system should have them defined, and if there is no metrics that measures what’s matter to users, monitoring strategy needs a refinement.
通过SLO塑造可靠性、可观察性金字塔基于系统行为的可测量指标(事件、时间系列)。每个系统都应该定义它们,如果没有衡量用户重要的指标,则监控策略需要改进。
By choosing the metric that expresses what’s matters most for the system, chosen metrics get promoted to SLI: Service Level Indicators. These metrics should be central for systems’ monitoring, they should be published and available to the support engineers. They could also be published to the users. Often, these metrics could also be collected by users themselves (e.g. virtual machine uptime is clearly available from the machine itself, availability could be measured by the external service etc), so these metrics are externally verifiable and thus should be open and transparent. Validity of the verifiable metrics is also important to increase overall trust in the system and cloud in a whole.
通过选择对系统最重要的指标,选定的指标将升级到SLI (Service Level Indicators):服务级别指标。这些指标应该是系统监控的核心,它们应该发布并给予工程师支持,它们也可以发布给用户。通常,这些指标也可以由用户自己收集(例如,虚拟机正常运行时间可以从计算机本身清楚地获得,可用性可以由外部服务衡量等),因此这些指标是可外部验证的,并且应该是开放和透明的。可验证指标的有效性对于提高客户对整个系统和云的整体信任也很重要。
Once you have SLI metrics, you should define the desired level of each of the metric that you aim to keep. The chosen levels are your SLOs: Service Level Objectives. E.g. while measured availability may be 100%, your SLO may be only 99.9%, so if the service gets unavailable for 5 minutes, your SLO is still met. Or if you want request latency be under 40ms, then some slowness during high peak will keep you within the budget.
一旦有了SLI指标,就可以定义目标保留的每个指标的所需级别。选定的级别是SLO (Service Level Objectives ):服务级别目标。例如,虽然测量的可用性可能为100%,但SLO可能只有99.9%,因此,如果服务在5分钟内不可用,SLO仍可满足。或者,如果您希望请求延迟低于40毫秒,那么在高峰期间,一些缓慢将使您保持在预算范围内。
Whenever you measure your indicator, it’s just a point. The SLO should not only have the level set, but also fix the time period for the computation. The “uptime is 99.99%” is underspecified and not measurable. The almost SLO definition is “uptime is 99.99% over 1 month”. To remove ambiguity, “uptime is 99.99% over any 30 days”.
每当测量指标时,它只是一个点。SLO不仅应设置级别,还应固定计算的时间段。“正常运行时间为99.99%”被低估,无法衡量。几乎所有SLO的定义是“1个月内的正常运行时间为99.99%”。为了消除模糊性,更改为“任何30天内的正常运行时间为99.99%”。
Once you set the SLO level and measurement period you’ve got SLO budget. E.g. 0.01% over 1 month is 4m23s. As long as your observed downtime is less than 4m23s over last 30d, your service is fine and meet the SLO.
一旦我们设置了SLO级别和测量周期,就获得了SLO预算。例如,1个月内的0.01%为4m23s。只要观察到的停机时间在过去30天内少于4m23s,说明服务就很好,并满足SLO的要求。
And the top of pyramid – SLA – describes everything together and also defines the compensation in case of violation. For example, Huawei Cloud Service Level Agreement for Elastic Volume Service first defines the SLI: “Service Unavailability” as “Service Unavailability means the Elastic Volume Service (EVS) system log shows that EVS was inaccessible for one (1) or more consecutive minutes because of Huawei Cloud” and then sets the SLO: “Huawei Cloud will use commercially and technically reasonable efforts to make the Service Availability Rate of ELB per Service Cycle not less than 99.95%” and the guarantee: “If we fail to meet the commitment above on ELB Service Availability Rate, Huawei Cloud will provide you with the below service credit”.
金字塔的顶端—SLA—描述了一切,并定义了违规时的赔偿。例如,华为云服务等级协议对于EVS (Elastic Volume Service) 弹性卷服务,首次定义SLI“服务不可用” 为 “服务不可用是指弹性卷服务(EVS)系统日志显示由于华为云EVS连续1分钟或以上无法访问”,然后设置SLO:“华为云将在商业和技术上做出合理的努力,使ELB每个服务周期的服务可用性率不低于99.95%”,并保证:“如果我们未能履行上述ELB服务可用性率承诺,华为云将为您提供以下服务积分”。
Ultimately, the SLO budget left is the single number that describes quality of a service. It doesn’t matter what your service is: is this encryption key storage, database or VPN service, is it 99% of availability over month or 99.999% durability over a year: percentage of SLO budget left will allow you to choose between prioritization of new business features and stability fixes for the next sprint. You probably don’t want to do non-critical update to your system if you only have 10% of budget left, but it’s fine to take some risk and try to deliver additional feature if you still have 99% of budget left.
最终,剩下的SLO预算是描述服务质量的单一数字。无论您的服务是什么:加密密钥存储、数据库还是VPN服务,是一个月内99%的可用性还是一年内99.999%的耐用性,剩余的SLO预算百分比将允许您在新业务功能的优先级排序和下一次冲刺的稳定性修复之间做出选择。如果您只剩下10%的预算,您可能不想对系统进行非关键更新,但如果您还有99%的预算,请冒险并尝试提供额外的功能是很好的。
When the system has several APIs with different SLO, each of them has their own budgets. It is possible and feasible to make separate decisions based on individual SLO budget left, but advisable to also define a combined metric, showing the overall service health. The combination takes weighted inputs to prevent less critical / known beta features instability from affecting overall system health view (i.e. exhaustion of 80% of budget for VM creation is more critical than exhaustion of 80% of VM snapshotting budget).
当系统有多个具有不同SLO的API时,每个API都有自己的预算。根据剩余的单个SLO预算做出单独的决定是可能和可行的,但最好定义一个综合指标,显示整体服务运行状况。该组合采用加权输入,以防止不那么重要/已知的测试版功能不稳定影响整体系统运行状况视图(即,耗尽80%的虚拟机创建预算比耗尽80%的虚拟机快照预算更重要)。
Design of SLO set for the service should be done aiming to highlight service strengths and limitations. While service may have unique requirements, there are few common SLOs:
为服务设计SLO集,以突出服务优势和限制。虽然服务可能有独特的要求,但很少有常见的SLO:
1. Availability – uptime, reachability, you name it. It shows how users can rely on fact that if they would want to connect / send request to the system it will respond. The better system availability, the less users need to worry about handling connection errors, timeouts etc accessing your system – but the more you need to worry about potentially disrupt customers with any changes on your side.
1. 可用性–正常运行时间、可达性。它显示了用户如何依靠这样一个事实,即如果他们想连接/发送请求到系统,它将响应。系统可用性越好,用户就越不需要担心处理连接错误、超时等访问系统——但服务提供者就越需要担心服务的任何更改可能会扰乱客户。
2. API Success rate – defines system reliability, setting expectation on how much users need to worry about their product graceful degradation in case of system misbehave. It’s important to keep in mind that error returned to invalid input is valid result – but error returned to valid input is system error. Success rate should be defined as rate of expected answers, not just rate of HTTP 200 responses to the overall input.
2.API成功率–定义系统可靠性,设置用户在系统行为不端时需要担心其产品优雅降级的预期。重要的是要记住,返回到无效输入的错误是有效的结果,但返回到有效输入的错误是系统错误。成功率应定义为预期答案的速率,而不仅仅是对整体输入的HTTP 200响应的速率。
Every data storage system (database, blob storage, filesystem and so on) should also define:
每个数据存储系统(数据库、blob存储、文件系统等)还应定义:
3. Durability – guaranteed % of data will be correctly available after significant period of time (the percentage could be low or over low period of time for cache system, but should be high and over long period for permanent storage)
耐用性–保证在相当长的时间后,数据的百分比将正确可用(对于缓存系统,百分比可能较低或超过较低的时间,但对于永久存储,百分比应该较高,并且较长时间)
4. Access speed – minimal read/write speed for the single operation or user.
访问速度–单个操作或用户的最低读/写速度。
The trickiest metrics to have SLO defined are latency metrics. As system depends complexity grows it become harder to predict and guarantee its behaviour. When system doesn’t set performance/latency expectations, users just run their test and use measured performance as baseline, which lead to wrong decisions and makes users unhappy every time your system performance deviates for their use-case. While it’s easy to provide average response time, usually it’s not very helpful to know how system will behave under load – knowing that system will answer in 100ms or less on 99% of requests is better than knowing the system has 50ms average response time. Typical way to work over is to set latency expectations in fine grained way:
定义SLO最棘手的指标是延迟指标。随着系统依赖复杂性的增长,预测和保证其行为变得更加困难。当系统不设置性能/延迟预期时,用户只需运行测试并将测量的性能用作基线,这将导致错误的决策,并在每次系统性能偏离其用例时让用户不高兴。虽然提供平均响应时间很容易,但通常了解系统在负载下的行为并没有很大帮助——知道系统将在100毫秒或更短的时间内响应99%的请求,比知道系统的平均响应时间为50毫秒要好。典型的工作方法是以细粒度的方式设置延迟预期:
5. API Latency – sets latency for the trivial requests or size-bounded requests;
API延迟–为琐碎请求或大小限制请求设置延迟;
6. Latency buckets – as alternative, latency could be defined for various request or response data size buckets
延迟存储桶–作为替代,可以为各种请求或响应数据大小存储桶定义延迟;
7. Stages latency – defines performance in terms of sequence:
阶段延迟–根据顺序定义性能:
a. Operation start latency (parse request, enqueue, until the first byte read etc);
操作开始延迟(解析请求、入队、直到第一个字节读取等);
b. Processing bandwidth
处理带宽
c. Operation end latency (commit last byte, close file etc).
操作结束延迟(提交最后一个字节、关闭文件等)。
Systems that has or provide long-running operations should be very careful with setting latency SLO or processing bandwidth, as usually it’s very hard to find the relevant and strongly correlated characteristics of requests to set the latency boundaries. But users will still try to rely on observed system behaviour for their queries/requests. Practically that means that system should at least define the worst-case scenario expectation (e.g. “1GB of data upload should take less than 5 minutes”) – but focus on providing stable performance of the system. E.g. you can’t define processing performance for various complexity of the input, but you can maintain similar performance across various CPU platforms you use.
具有或提供长期运行操作的系统应非常小心地设置延迟SLO或处理带宽,因为通常很难找到请求的相关和强相关特征来设置延迟边界。但用户仍将尝试依靠观察到的系统行为来进行查询/请求。实际上,这意味着系统至少应该定义最坏情况的预期(例如,“1GB数据上传时间应少于5分钟”)——但重点是提供系统的稳定性能。例如,我们不能为输入的各种复杂性定义处理性能,但可以在使用的各种CPU平台上保持类似的性能。
By finding the right metrics and setting the right goals for them you communicate with your customers what your system can and can’t, you shape their expectations and enable them to build anything they can imagine on your platform – without hurting your system operations or undermining users’ trust. Thus, SLO is a universal language the system architects communicate with each other, empowering and promoting best possible designs, meeting every ones’ goals as simple as possible – but not simpler.
通过找到正确的指标并为他们设定正确的目标,可以与客户沟通系统能做什么,不能做什么,可以塑造客户期望,使客户能够在平台上构建他们能想象的任何东西——而不会损害系统运营或破坏用户的信任。因此,SLO是一种通用语言,系统架构师相互交流、授权和促进最佳设计,尽可能简单地实现每个人的目标——但并不简单。
- 点赞
- 收藏
- 关注作者
评论(0)