Java 零拷贝基准测试
【摘要】 我将通过一些 Java 代码示例来介绍这一点,这些代码将大文件合并到一个目标文件中。对于合并代码,我将使用两种不同的方法:使用NIO API(零拷贝)使用IO API为了深入探究零拷贝性能更好的原因,我将使用 jmh 对这两种方法进行基准测试。通过查看结果,我将指出一些数字来说明为什么零复制方法表现更好。代码NIO它将中继FileChannel#transfer使用 syscall 的NIO...
我将通过一些 Java 代码示例来介绍这一点,这些代码将大文件合并到一个目标文件中。对于合并代码,我将使用两种不同的方法:
- 使用NIO API(零拷贝)
- 使用IO API
为了深入探究零拷贝性能更好的原因,我将使用 jmh 对这两种方法进行基准测试。通过查看结果,我将指出一些数字来说明为什么零复制方法表现更好。
代码
NIO
它将中继FileChannel#transfer
使用 syscall 的NIO API sendfile()
。通过使用此调用,内核将负责从源读取数据并将数据写入目标,而无需离开内核空间,从而实现零复制。
@Benchmark
public void mergeNIO() throws Exception {
var outFile = Files.createTempFile("benchmark", ".file");
try (var out = FileChannel.open(outFile, CREATE, WRITE, DELETE_ON_CLOSE)) {
for (var f : files) {
try (FileChannel in = FileChannel.open(f.toPath(), READ)) {
for (long p = 0, l = in.size(); p < l; ) {
p += in.transferTo(p, l - p, out);
}
}
}
}
}
IO
它将把数据逐块读取InputStream
到缓冲区中,并将它们送入OutputStream
. 在底层,将执行一个read()
系统调用和一个write()
系统调用,以及从用户空间到内核空间的缓冲区复制,反之亦然。
@Benchmark
public void mergeIO() throws Exception {
var outFile = Files.createTempFile("benchmark", ".file");
try (var out = Files.newOutputStream(outFile, CREATE, WRITE, DELETE_ON_CLOSE)) {
for (var f : files) {
try (var is = Files.newInputStream(f.toPath())) {
byte[] buffer = new byte[16*1024];
int read;
while ((read = is.read(buffer, 0, 16*1024)) >= 0) {
out.write(buffer, 0, read);
}
}
}
}
}
基准测试结果
对于 jmh 执行,我使用了 gradle 插件me.champeau.jmh并进行了以下profilers
配置:
jmh {
profilers = ["gc", "perf", "perfasm"]
}
NIO
概括
Benchmark Mode Cnt Score Error Units
FileBenchmark.mergeNIO avgt 25 3.733 ± 0.145 s/op
FileBenchmark.mergeNIO:·cpi avgt 0.872 clks/insn
FileBenchmark.mergeNIO:·gc.alloc.rate avgt 25 0.002 ± 0.001 MB/sec
FileBenchmark.mergeNIO:·gc.alloc.rate.norm avgt 25 7559.760 ± 44.878 B/op
FileBenchmark.mergeNIO:·gc.count avgt 25 ≈ 0 counts
FileBenchmark.mergeNIO:·ipc avgt 1.147 insns/clk
性能
Secondary result "org.example.FileBenchmark.mergeNIO:·perf":
Perf stats:
--------------------------------------------------
56527,762660 task-clock (msec) # 0,720 CPUs utilized
4.268 context-switches # 0,076 K/sec
378 cpu-migrations # 0,007 K/sec
385 page-faults # 0,007 K/sec
44.583.533.420 cycles # 0,789 GHz (30,73%)
51.560.575.148 instructions # 1,16 insn per cycle (38,44%)
9.135.306.805 branches # 161,607 M/sec (38,46%)
110.707.654 branch-misses # 1,21% of all branches (38,45%)
15.420.453.382 L1-dcache-loads # 272,794 M/sec (24,72%)
1.013.435.761 L1-dcache-load-misses # 6,57% of all L1-dcache hits (15,43%)
283.399.957 LLC-loads # 5,013 M/sec (15,45%)
98.132.015 LLC-load-misses # 34,63% of all LL-cache hits (23,11%)
<not supported> L1-icache-loads
303.670.304 L1-icache-load-misses (30,80%)
15.177.454.388 dTLB-loads # 268,496 M/sec (29,93%)
3.568.314 dTLB-load-misses # 0,02% of all dTLB cache hits (16,71%)
453.513 iTLB-loads # 0,008 M/sec (15,37%)
195.070 iTLB-load-misses # 43,01% of all iTLB cache hits (23,03%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
78,491809508 seconds time elapsed
性能ASM
....[Hottest Methods (after inlining)]..........................................................
98.59% [kernel.kallsyms] [unknown]
0.19% libjvm.so ElfSymbolTable::lookup
0.14% <unknown>
0.12% libc-2.27.so vfprintf
0.07% libc-2.27.so _IO_fwrite
0.04% libc-2.27.so _IO_default_xsputn
0.03% libjvm.so outputStream::do_vsnprintf_and_write_with_automatic_buffer
0.03% libpthread-2.27.so __libc_write
0.03% libjvm.so xmlStream::write_text
0.03% libjvm.so defaultStream::hold
0.02% libjvm.so stringStream::write
0.02% libjvm.so outputStream::update_position
0.02% libc-2.27.so syscall
0.02% libjvm.so defaultStream::write
0.02% ld-2.27.so __tls_get_addr
0.02% libjvm.so fileStream::write
0.02% libc-2.27.so vsnprintf
0.02% libjvm.so RelocIterator::initialize
0.02% libc-2.27.so [unknown]
0.02% libjvm.so outputStream::print
0.55% <...other 207 warm methods...>
................................................................................................
100.00% <totals>
IO
概括
Benchmark Mode Cnt Score Error Units
FileBenchmark.mergeIO avgt 25 5.028 ± 0.183 s/op
FileBenchmark.mergeIO:·cpi avgt 0.966 clks/insn
FileBenchmark.mergeIO:·gc.alloc.rate avgt 25 0.029 ± 0.001 MB/sec
FileBenchmark.mergeIO:·gc.alloc.rate.norm avgt 25 154191.947 ± 51.517 B/op
FileBenchmark.mergeIO:·gc.count avgt 25 ≈ 0 counts
FileBenchmark.mergeIO:·ipc avgt 1.035 insns/clk
性能
Secondary result "org.example.FileBenchmark.mergeIO:·perf":
Perf stats:
--------------------------------------------------
72972,795879 task-clock (msec) # 0,772 CPUs utilized
4.971 context-switches # 0,068 K/sec
546 cpu-migrations # 0,007 K/sec
911 page-faults # 0,012 K/sec
57.575.878.331 cycles # 0,789 GHz (30,73%)
60.626.232.521 instructions # 1,05 insn per cycle (38,43%)
10.920.368.865 branches # 149,650 M/sec (38,40%)
129.224.774 branch-misses # 1,18% of all branches (38,44%)
18.526.168.088 L1-dcache-loads # 253,878 M/sec (25,30%)
2.885.163.789 L1-dcache-load-misses # 15,57% of all L1-dcache hits (17,75%)
306.897.902 LLC-loads # 4,206 M/sec (17,32%)
103.565.396 LLC-load-misses # 33,75% of all LL-cache hits (23,08%)
<not supported> L1-icache-loads
438.074.938 L1-icache-load-misses (30,77%)
18.537.122.526 dTLB-loads # 254,028 M/sec (26,65%)
8.112.288 dTLB-load-misses # 0,04% of all dTLB cache hits (18,35%)
2.404.472 iTLB-loads # 0,033 M/sec (15,39%)
8.051.716 iTLB-load-misses # 334,86% of all iTLB cache hits (23,05%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
94,524422154 seconds time elapsed
性能ASM
....[Hottest Methods (after inlining)]..........................................................
88.57% [kernel.kallsyms] [unknown]
6.70% runtime stub StubRoutines::jlong_disjoint_arraycopy
0.97% c2, level 4 java.nio.channels.Channels::writeFully, version 2, compile id 697
0.92% c2, level 4 sun.nio.ch.ChannelInputStream::read, version 2, compile id 706
0.24% Unknown, level 0 sun.nio.ch.NativeThread::current, version 1, compile id 617
0.23% Unknown, level 0 sun.nio.ch.FileDispatcherImpl::write0, version 1, compile id 656
0.22% c2, level 4 org.example.FileBenchmark::mergeIO, version 4, compile id 725
0.19% <unknown>
0.18% libpthread-2.27.so __libc_write
0.18% libpthread-2.27.so __pthread_disable_asynccancel
0.15% libpthread-2.27.so __pthread_enable_asynccancel
0.13% libjvm.so ElfSymbolTable::lookup
0.13% libpthread-2.27.so __libc_read
0.11% Unknown, level 0 sun.nio.ch.FileDispatcherImpl::read0, version 1, compile id 654
0.09% libc-2.27.so vfprintf
0.08% libnio.so fdval
0.06% libnio.so Java_sun_nio_ch_FileDispatcherImpl_write0
0.04% libc-2.27.so _IO_fwrite
0.03% libc-2.27.so _IO_default_xsputn
0.03% libjvm.so xmlStream::write_text
0.76% <...other 201 warm methods...>
................................................................................................
100.00% <totals>
结论
如果我们查看两个摘要报告,我们可以看到 NIO 方法更快:3.733 s/op
vs ,并且与 IO 方法相比,5.028 s/op
它几乎不需要堆分配 ( 0.002 MB/sec
vs )。0.029 MB/sec
通过查看 perf 和 perfASM 报告,我们可以了解为什么 NIO 更快。
- 与缺乏系统调用
385
相比,它会导致更少的页面错误911
read()
- 它不会花时间进行数组复制操作 (
StubRoutines::jlong_disjoint_arraycopy
),从而尊重零复制名称 😄
【版权声明】本文为华为云社区用户原创内容,未经允许不得转载,如需转载请自行联系原作者进行授权。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
cloudbbs@huaweicloud.com
- 点赞
- 收藏
- 关注作者
评论(0)