- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

ChaoBlade 的实现原理

zuozewei 发表于 2021/09/28 12:31:45 2021/09/28

【摘要】其实英文中的 chaos 的含义是混乱。这和中文的混沌是非常不同的概念，现在这个概念被翻译成混沌，真是拉低了混沌这个词本身该有的寓意。而混沌工程是什么呢，在各个层面又要如何实现呢？其实不是工具有多难实现的问题。而是在具体的实现逻辑是什么？是否真实描述生产场景？所以最核心的是场景设计。

混沌工程的定义

根据混沌工程的principles，里面这样定义了：

Chaos Engineering is the discipline of experimenting on a system in
order to build confidence in the system’s capability to withstand
turbulent conditions in production.

中文翻译是这样的：

混沌工程是在分布式系统上进行实验的学科, 目的是建立对系统抵御生产环境中失控条件的能力以及信心。
英文中似乎没有分布式系统这个字眼，看来中文翻译的时候把范围说小了。

它有原则描述：

建立一个围绕稳定状态行为的假说
多样化真实世界的事件
在生产环境中运行实验
持续自动化运行实验
最小化爆炸半径

看着有些比较新鲜的词还挺有意思。也有人把它和异常测试、故障测试啥的给区分开来说明。要说还是得整概念，概念还是要先于技术的发展，给技术指导一个方向，而落地嘛，总是需要一些时间的。

据说阿里的 chaosblade 开源工具算是具有混沌工程特点的工具。下面看一下它的功能。

下载并解压

这个工具非常简单，下载解压就能用。

[gaolou@7dgroup2 ~]$ wget -c https://github.com/chaosblade-io/chaosblade/releases/download/v0.2.0/chaosblade-0.2.0.linux-amd64.tar.gz
[gaolou@7dgroup2 ~]$ tar zxvf chaosblade-0.2.0.linux-amd64.tar.gz

使用及实现

模拟CPU负载

[gaolou@7dgroup2 chaosblade-0.2.0]$ ./blade  create cpu fullload
{"code":200,"success":true,"result":"cb6300fd4899c537"}
[gaolou@7dgroup2 chaosblade-0.2.0]$

查看模拟效果：

通过上图可以看到确实实现了us CPU 使用率消耗的效果。

再来看一下它是怎么实现的。

burnCpu 这个方法里的。关键源码如下：

func runBurnCpu(ctx context.Context, cpuCount int, cpuPercent int, pidNeeded bool, processor string) int {
  args := fmt.Sprintf(`%s --nohup --cpu-count %d --cpu-percent %d`,
    path.Join(util.GetProgramPath(), burnCpuBin), cpuCount, cpuPercent)
  if pidNeeded {
    args = fmt.Sprintf("%s --cpu-processor %s", args, processor)
  }
  args = fmt.Sprintf(`%s > /dev/null 2>&1 &`, args)
  response := channel.Run(ctx, "nohup", args)
  if !response.Success {
    stopBurnCpuFunc()
    bin.PrintErrAndExit(response.Err)
  }
  if pidNeeded {
    // parse pid
    newCtx := context.WithValue(context.Background(), exec.ProcessKey, fmt.Sprintf("cpu-processor %s", processor))
    pids, err := exec.GetPidsByProcessName(burnCpuBin, newCtx)
    if err != nil {
      stopBurnCpuFunc()
      bin.PrintErrAndExit(fmt.Sprintf("bind cpu core failed, cannot get the burning program pid, %v", err))
    }
    if len(pids) > 0 {
      // return the first one
      pid, err := strconv.Atoi(pids[0])
      if err != nil {
        stopBurnCpuFunc()
        bin.PrintErrAndExit(fmt.Sprintf("bind cpu core failed, get pid failed, pids: %v, err: %v", pids, err))
      }
      return pid
    }
  }
  return -1
}

其他关联的代码就不帖了。总的来说，就是写了一个小程序把 CPU 消耗掉，这个功能一个 do while 就可以了。

模拟IO高

[root@7dgroup2 chaosblade-0.2.0]# ./blade create disk burn --write --read  --size 10 --count 1024  --timeout 300
{"code":200,"success":true,"result":"f026b3510722685d"}

查看模拟效果：

[root@7dgroup2 chaosblade-0.2.0]#

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00    91.00  250.00  815.00 84892.00 92588.00   333.30    43.92   39.27   41.60   38.56   0.93  99.50
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               1.00   105.00  496.00  865.00 98012.00 92692.00   280.24    43.72   34.02   33.40   34.37   0.73  99.40
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.99   106.93  259.41  675.25 99853.47 91750.50   410.00    36.22   38.53   47.09   35.24   1.06  98.81
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00    80.00  241.00 1103.00 116340.00 82296.00   295.59    44.06   33.03   47.92   29.78   0.74  99.90
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

从上面的结果来看，确实把 IO 给消耗掉了。下来我们看看它是怎么实现消耗的。

TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
24036 be/4 root      104.55 M/s    0.00 B/s  0.00 % 99.99 % dd if=/dev/vda1 of=/dev/null~ iflag=dsync,direct,fullblock
24034 be/4 root        0.00 B/s  104.55 M/s  0.00 % 68.17 % dd if=/dev/zero of=/tmp/chao~bs=10M count=1024 oflag=dsync

通过查看 io 高的进程就可以看到这两个进程。也就是说，chaosblade 调用 dd 实现的 IO 高模拟。关键实现代码如下：

// write burn
func burnWrite(size, count string) {
  for {
    args := fmt.Sprintf(`if=/dev/zero of=%s bs=%sM count=%s oflag=dsync`, tmpDataFile, size, count)
    response := channel.Run(context.Background(), "dd", args)
    channel.Run(context.Background(), "rm", fmt.Sprintf(`-rf %s`, tmpDataFile))
    if !response.Success {
      bin.PrintAndExitWithErrPrefix(response.Err)
      return
    }
  }
}
// read burn
func burnRead(fileSystem, size, count string) {
  for {
    // "if" arg in dd command is file system value, but "of" arg value is related to mount point
    args := fmt.Sprintf(`if=%s of=/dev/null bs=%sM count=%s iflag=dsync,direct,fullblock`, fileSystem, size, count)
    response := channel.Run(context.Background(), "dd", args)
    if !response.Success {
      bin.PrintAndExitWithErrPrefix(fmt.Sprintf("The file system named %s is not supported or %s", fileSystem, response.Err))
    }
  }
}

一个读一个写。

模拟端口不通

模拟之前

(base) GaoLouMac:~ Zee$ telnet 101.201.210.163 9100
Trying 101.201.210.163...

Connected to 101.201.210.163.
Escape character is '^]'.

可以看到这个端口是通的

模拟端口不通

[root@7dgroup2 chaosblade-0.2.0]# ./blade create network drop --local-port 9100
{"code":200,"success":true,"result":"55321ca383ef272c"}
[root@7dgroup2 chaosblade-0.2.0]#

模拟之后

可以看到端口已经连不上了

(base) GaoLouMac:~ Zee$  telnet 101.201.210.163 9100
Trying 101.201.210.163...
telnet: connect to address 101.201.210.163: Operation timed out
telnet: Unable to connect to remote host
(base) GaoLouMac:~ Zee$

可是怎么实现的端口连不上呢？

实现代码

通过如下代码，可以看到，ChaosBlade 是通过 iptables 命令添加 drop 规则来实现的禁用端口。

以下代码在 dropnetwork.go 中可以看到：

if localPort != "" {
  channel.Run(ctx, "iptables", fmt.Sprintf(`-D INPUT -p tcp --dport %s -j DROP`, localPort))
  channel.Run(ctx, "iptables", fmt.Sprintf(`-D INPUT -p udp --dport %s -j DROP`, localPort))
}

iptables 配置：

[root@7dgroup2 chaosblade-0.2.0]# iptables -L -n|grep 9100
DROP       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9100
DROP       udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:9100
[root@7dgroup2 chaosblade-0.2.0]#

通过查询 iptables 记录，可以看到，ChaoBlade 添加了两条记录把 9100 端口的 tcp、udp 包都 drop 掉。大家注意一下，这个操作只是暂时生效，iptables 的文件中是没有记录的。

这种模拟效果是什么样呢？

模拟效果解析

模拟之前抓包结果：


[root@7dgroup2 ~]# tcpdump -i eth0 port 9000
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
18:40:19.162485 IP 61.148.243.67.9485 > 7dgroup2.cslistener: Flags [S], seq 4090540787, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187658956 ecr 0,sackOK,eol], length 0
18:40:19.162592 IP 7dgroup2.cslistener > 61.148.243.67.9485: Flags [S.], seq 3080683668, ack 4090540788, win 28960, options [mss 1460,sackOK,TS val 871980746 ecr 1187658956,nop,wscale 7], length 0
18:40:19.202395 IP 61.148.243.67.9485 > 7dgroup2.cslistener: Flags [.], ack 1, win 4120, options [nop,nop,TS val 1187658998 ecr 871980746], length 0

// 上面是连接过程
// 下面是断开过程

18:40:51.771422 IP 61.148.243.67.9485 > 7dgroup2.cslistener: Flags [P.], seq 1:7, ack 1, win 4120, options [nop,nop,TS val 1187690315 ecr 871980746], length 6
18:40:51.771534 IP 7dgroup2.cslistener > 61.148.243.67.9485: Flags [.], ack 7, win 227, options [nop,nop,TS val 872013355 ecr 1187690315], length 0
18:40:51.772024 IP 7dgroup2.cslistener > 61.148.243.67.9485: Flags [P.], seq 1:99, ack 7, win 227, options [nop,nop,TS val 872013355 ecr 1187690315], length 98
18:40:51.772062 IP 7dgroup2.cslistener > 61.148.243.67.9485: Flags [F.], seq 99, ack 7, win 227, options [nop,nop,TS val 872013355 ecr 1187690315], length 0
18:40:51.821279 IP 61.148.243.67.9485 > 7dgroup2.cslistener: Flags [.], ack 99, win 4117, options [nop,nop,TS val 1187690362 ecr 872013355], length 0
18:40:51.821336 IP 61.148.243.67.9485 > 7dgroup2.cslistener: Flags [.], ack 100, win 4117, options [nop,nop,TS val 1187690362 ecr 872013355], length 0
18:40:51.821355 IP 61.148.243.67.9485 > 7dgroup2.cslistener: Flags [F.], seq 7, ack 100, win 4117, options [nop,nop,TS val 1187690364 ecr 872013355], length 0
18:40:51.821380 IP 7dgroup2.cslistener > 61.148.243.67.9485: Flags [.], ack 8, win 227, options [nop,nop,TS val 872013404 ecr 1187690364], length 0

从上面的结果来看，没有创建 iptable 规则之前，通讯完全正常
标准的 tcp 握手和挥手的过程呀

模拟之后抓包结果

18:43:12.531311 IP 61.148.243.67.9486 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187826295 ecr 0,sackOK,eol], length 0
18:43:13.551168 IP 61.148.243.67.9486 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187827296 ecr 0,sackOK,eol], length 0
18:43:14.611149 IP 61.148.243.67.9486 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187828296 ecr 0,sackOK,eol], length 0
18:43:15.582777 IP 61.148.243.67.9486 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187829296 ecr 0,sackOK,eol], length 0
18:43:16.622832 IP 61.148.243.67.9486 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187830296 ecr 0,sackOK,eol], length 0
18:43:17.654309 IP 61.148.243.67.9486 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187831296 ecr 0,sackOK,eol], length 0
18:43:19.691527 IP 61.148.243.67.9486 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187833296 ecr 0,sackOK,eol], length 0
18:43:23.741290 IP 61.148.243.67.9486 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187837296 ecr 0,sackOK,eol], length 0
18:43:31.761123 IP 61.148.243.67.9486 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187845296 ecr 0,sackOK,eol], length 0
18:43:48.062869 IP 61.148.243.67.9487 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187861296 ecr 0,sackOK,eol], length 0
18:44:20.852129 IP 61.148.243.67.9705 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,sackOK,eol], length 0

创建 iptables 之后，我们照样执行尝试连接的动作。看到服务端还是抓到了syn 包的

（看到这里，我想有点安全意识的人都知道风险在哪了吧，攻击场景立即在脑子里跳出来了）

在线上环境中，在这个层面把tcp握手就断掉的真实应用的问题场景还是非常少的。tcp半连接有问题的时候，才可能出现这种情况。

如果要想模拟应用层面的 connection 问题，ChaosBlade 做不到的。

丢包模拟

模拟命令

[root@7dgroup2 chaosblade-0.2.0]# ./blade create network loss --interface eth0 --percent 50
{"code":200,"success":true,"result":"c29053229c16c839"}
[root@7dgroup2 chaosblade-0.2.0]#

丢包效果

(base) GaoLouMac:~ Zee$ ping 101.201.210.163
PING 101.201.210.163 (101.201.210.163): 56 data bytes
64 bytes from 101.201.210.163: icmp_seq=0 ttl=50 time=95.615 ms
64 bytes from 101.201.210.163: icmp_seq=1 ttl=50 time=78.823 ms
Request timeout for icmp_seq 2
Request timeout for icmp_seq 3
64 bytes from 101.201.210.163: icmp_seq=4 ttl=50 time=127.879 ms
64 bytes from 101.201.210.163: icmp_seq=5 ttl=50 time=123.282 ms
64 bytes from 101.201.210.163: icmp_seq=6 ttl=50 time=129.193 ms
Request timeout for icmp_seq 7
Request timeout for icmp_seq 8
64 bytes from 101.201.210.163: icmp_seq=9 ttl=50 time=123.712 ms
Request timeout for icmp_seq 10
64 bytes from 101.201.210.163: icmp_seq=11 ttl=50 time=36.746 ms
64 bytes from 101.201.210.163: icmp_seq=12 ttl=50 time=114.155 ms
Request timeout for icmp_seq 13
Request timeout for icmp_seq 14
64 bytes from 101.201.210.163: icmp_seq=15 ttl=50 time=91.469 ms
Request timeout for icmp_seq 16
64 bytes from 101.201.210.163: icmp_seq=17 ttl=50 time=56.911 ms
64 bytes from 101.201.210.163: icmp_seq=18 ttl=50 time=113.380 ms
Request timeout for icmp_seq 19

代码实现

// addQdiscForLoss
func addQdiscForLoss(channel exec.Channel, ctx context.Context, netInterface string, percent string) *transport.Response {
  // invoke tc qdisc add dev ${networkPort} root handle 1: prio bands 4
  response := channel.Run(ctx, "tc", fmt.Sprintf(`qdisc add dev %s root handle 1: prio bands 4`, netInterface))
  if !response.Success {
    // invoke stop
    stopLossNetFunc(netInterface)
    bin.PrintErrAndExit(response.Err)
    return response
  }
  response = channel.Run(ctx, "tc", fmt.Sprintf(`qdisc add dev %s parent 1:4 handle 40: netem loss %s%%`, netInterface, percent))
  if !response.Success {
    // invoke stop
    stopLossNetFunc(netInterface)
    bin.PrintErrAndExit(response.Err)
    return response
  }
  return response
}

通过以上代码，可以看到 ChaosBlade 是通过 traffic control 添加过滤器队列、分类、过滤器来实现的。也就是 tc 的 netem loss。

模拟网络延时

模拟命令

[root@7dgroup2 chaosblade-0.2.0]# ./blade create network delay --interface eth0 --time 3000
{"code":200,"success":true,"result":"b9e568d93dcbb5cb"}
[root@7dgroup2 chaosblade-0.2.0]#

模拟效果

(base) GaoLouMac:~ Zee$ telnet 101.201.210.163 9100
Trying 101.201.210.163...

// 这里有三秒的延时

Connected to 101.201.210.163.
Escape character is '^]'.

代码实现


func startDelayNet(netInterface, time, offset, localPort, remotePort, excludePort string) {
  ctx := context.Background()
  // assert localPort and remotePort
  if localPort == "" && remotePort == "" && excludePort == "" {
    response := channel.Run(ctx, "tc", fmt.Sprintf(`qdisc add dev %s root netem delay %sms %sms`, netInterface, time, offset))
    if !response.Success {
      bin.PrintErrAndExit(response.Err)
    }
    bin.PrintOutputAndExit(response.Result.(string))
    return
  }
  response := addQdiscForDelay(channel, ctx, netInterface, time, offset)
  if localPort == "" && remotePort == "" && excludePort != "" {
    response = addExcludePortFilterForDelay(excludePort, netInterface, response, channel, ctx)
    bin.PrintOutputAndExit(response.Result.(string))
    return
  }
  response = addLocalOrRemotePortForDelay(localPort, response, channel, ctx, netInterface, remotePort)
  bin.PrintOutputAndExit(response.Result.(string))
}

通过以上代码，可以看到 ChaosBlade 是也是通过 traffic control 添加过滤器队列、分类、过滤器来实现的网络延时。也就是 tc 的netem delay。

也即是 ChaosBalde 是通过将 tc 来实现的模拟丢包和延时。

总结

这个 chaosblade 实际上可以看做是一个工具集，集成了各种小工具。

混沌的帽子在这个工具，现在套着还是有点大。要想用它来实现上千上万个节点的模拟，还需要各种集成配置，远程执行等工具的配合。

大家再回过头来看看上面写的混沌工程定义的原则。这些模拟有没有符合这些原则呢？如果各位有处理生产环境的经验的话，会知道，这样的模拟，其实和真实环境下的 CPU 高、IO 高的逻辑还是有不同的。

通常我们说一个应用程序的在CPU高的情况下是否能保持健壮。有两种含义：

其他程序在消耗CPU较高的情况下，被测试的程序是否能保持健壮。
是指的是这个应用本身的代码消耗了大量CPU的情况下，被测试程序是否能保持健壮。

有处理过生产类似问题的朋友们会知道，第一种情况，除了部署上的不合理会出现之外，几乎是看不到的。chaosblade其实是模拟的这种情况。而第二种情况，chaosblade 现在还是做不到的。

但第二种情况却是测试过程中的重点。

其实英文中的 chaos 的含义是混乱。这和中文的混沌是非常不同的概念，现在这个概念被翻译成混沌，真是拉低了混沌这个词本身该有的寓意。

而混沌工程是什么呢，在各个层面又要如何实现呢？其实不是工具有多难实现的问题。而是在具体的实现逻辑是什么？是否真实描述生产场景？

所以最核心的是场景设计

【声明】本内容来自华为云开发者社区博主，不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源（华为云社区）、文章链接、文章作者等基本信息，否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容，欢迎发送邮件进行举报，并提供相关证据，一经查实，本社区将立刻删除涉嫌侵权内容，举报邮箱： cloudbbs@huaweicloud.com

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

ChaoBlade 的实现原理

混沌工程的定义

下载并解压

使用及实现

模拟CPU负载

模拟IO高

模拟端口不通

模拟之前

模拟端口不通

模拟之后

实现代码

模拟效果解析

模拟之后抓包结果

丢包模拟

模拟命令

丢包效果

代码实现

模拟网络延时

模拟命令

模拟效果

代码实现

总结

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

ChaoBlade 的实现原理

混沌工程的定义

下载并解压

使用及实现

模拟CPU负载

模拟IO高

模拟端口不通

模拟之前

模拟端口不通

模拟之后

实现代码

模拟效果解析

模拟之后抓包结果

丢包模拟

模拟命令

丢包效果

代码实现

模拟网络延时

模拟命令

模拟效果

代码实现

总结

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品