Java 零拷贝基准测试

举报
千锋教育 发表于 2023/07/19 13:39:32 2023/07/19
【摘要】 我将通过一些 Java 代码示例来介绍这一点,这些代码将大文件合并到一个目标文件中。对于合并代码,我将使用两种不同的方法:使用NIO API(零拷贝)使用IO API为了深入探究零拷贝性能更好的原因,我将使用 jmh 对这两种方法进行基准测试。通过查看结果,我将指出一些数字来说明为什么零复制方法表现更好。代码NIO它将中继FileChannel#transfer使用 syscall 的NIO...

Java 零拷贝基准测试.jpg

我将通过一些 Java 代码示例来介绍这一点,这些代码将大文件合并到一个目标文件中。对于合并代码,我将使用两种不同的方法:

  • 使用NIO API(零拷贝)
  • 使用IO API

为了深入探究零拷贝性能更好的原因,我将使用 jmh 对这两种方法进行基准测试。通过查看结果,我将指出一些数字来说明为什么零复制方法表现更好。

代码

NIO

它将中继FileChannel#transfer使用 syscall 的NIO API sendfile()。通过使用此调用,内核将负责从源读取数据并将数据写入目标,而无需离开内核空间,从而实现零复制。

@Benchmark
public void mergeNIO() throws Exception {
    var outFile = Files.createTempFile("benchmark", ".file");
    try (var out = FileChannel.open(outFile, CREATE, WRITE, DELETE_ON_CLOSE)) {
        for (var f : files) {
            try (FileChannel in = FileChannel.open(f.toPath(), READ)) {
                for (long p = 0, l = in.size(); p < l; ) {
                    p += in.transferTo(p, l - p, out);
                }
            }
        }
    }
}

IO

它将把数据逐块读取InputStream到缓冲区中,并将它们送入OutputStream. 在底层,将执行一个read()系统调用和一个write()系统调用,以及从用户空间到内核空间的缓冲区复制,反之亦然。

@Benchmark
public void mergeIO() throws Exception {
    var outFile = Files.createTempFile("benchmark", ".file");
    try (var out = Files.newOutputStream(outFile, CREATE, WRITE, DELETE_ON_CLOSE)) {
        for (var f : files) {
            try (var is = Files.newInputStream(f.toPath())) {
                byte[] buffer = new byte[16*1024];
                int read;
                while ((read = is.read(buffer, 0, 16*1024)) >= 0) {
                    out.write(buffer, 0, read);
                }
            }
        }
    }
}

基准测试结果

对于 jmh 执行,我使用了 gradle 插件me.champeau.jmh并进行了以下profilers配置:

jmh {
    profilers = ["gc", "perf", "perfasm"]
}

NIO

概括


Benchmark                                   Mode  Cnt       Score      Error   Units
FileBenchmark.mergeNIO                      avgt   25       3.733 ±  0.145       s/op
FileBenchmark.mergeNIO:·cpi                 avgt            0.872           clks/insn
FileBenchmark.mergeNIO:·gc.alloc.rate       avgt   25       0.002 ±  0.001     MB/sec
FileBenchmark.mergeNIO:·gc.alloc.rate.norm  avgt   25    7559.760 ± 44.878       B/op
FileBenchmark.mergeNIO:·gc.count            avgt   250              counts
FileBenchmark.mergeNIO:·ipc                 avgt            1.147           insns/clk

性能

Secondary result "org.example.FileBenchmark.mergeNIO:·perf":
Perf stats:
--------------------------------------------------

      56527,762660      task-clock (msec)         #    0,720 CPUs utilized          
             4.268      context-switches          #    0,076 K/sec                  
               378      cpu-migrations            #    0,007 K/sec                  
               385      page-faults               #    0,007 K/sec                  
    44.583.533.420      cycles                    #    0,789 GHz                      (30,73%)
    51.560.575.148      instructions              #    1,16  insn per cycle           (38,44%)
     9.135.306.805      branches                  #  161,607 M/sec                    (38,46%)
       110.707.654      branch-misses             #    1,21% of all branches          (38,45%)
    15.420.453.382      L1-dcache-loads           #  272,794 M/sec                    (24,72%)
     1.013.435.761      L1-dcache-load-misses     #    6,57% of all L1-dcache hits    (15,43%)
       283.399.957      LLC-loads                 #    5,013 M/sec                    (15,45%)
        98.132.015      LLC-load-misses           #   34,63% of all LL-cache hits     (23,11%)
   <not supported>      L1-icache-loads                                             
       303.670.304      L1-icache-load-misses                                         (30,80%)
    15.177.454.388      dTLB-loads                #  268,496 M/sec                    (29,93%)
         3.568.314      dTLB-load-misses          #    0,02% of all dTLB cache hits   (16,71%)
           453.513      iTLB-loads                #    0,008 M/sec                    (15,37%)
           195.070      iTLB-load-misses          #   43,01% of all iTLB cache hits   (23,03%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      78,491809508 seconds time elapsed

性能ASM


....[Hottest Methods (after inlining)]..........................................................
  98.59%   [kernel.kallsyms]  [unknown] 
   0.19%           libjvm.so  ElfSymbolTable::lookup 
   0.14%                      <unknown> 
   0.12%        libc-2.27.so  vfprintf 
   0.07%        libc-2.27.so  _IO_fwrite 
   0.04%        libc-2.27.so  _IO_default_xsputn 
   0.03%           libjvm.so  outputStream::do_vsnprintf_and_write_with_automatic_buffer 
   0.03%  libpthread-2.27.so  __libc_write 
   0.03%           libjvm.so  xmlStream::write_text 
   0.03%           libjvm.so  defaultStream::hold 
   0.02%           libjvm.so  stringStream::write 
   0.02%           libjvm.so  outputStream::update_position 
   0.02%        libc-2.27.so  syscall 
   0.02%           libjvm.so  defaultStream::write 
   0.02%          ld-2.27.so  __tls_get_addr 
   0.02%           libjvm.so  fileStream::write 
   0.02%        libc-2.27.so  vsnprintf 
   0.02%           libjvm.so  RelocIterator::initialize 
   0.02%        libc-2.27.so  [unknown] 
   0.02%           libjvm.so  outputStream::print 
   0.55%  <...other 207 warm methods...>
................................................................................................
 100.00%  <totals>

IO

概括

Benchmark                                   Mode  Cnt       Score      Error   Units
FileBenchmark.mergeIO                       avgt   25       5.028 ±  0.183       s/op
FileBenchmark.mergeIO:·cpi                  avgt            0.966           clks/insn
FileBenchmark.mergeIO:·gc.alloc.rate        avgt   25       0.029 ±  0.001     MB/sec
FileBenchmark.mergeIO:·gc.alloc.rate.norm   avgt   25  154191.947 ± 51.517       B/op
FileBenchmark.mergeIO:·gc.count             avgt   250              counts
FileBenchmark.mergeIO:·ipc                  avgt            1.035           insns/clk

性能

Secondary result "org.example.FileBenchmark.mergeIO:·perf":
Perf stats:
--------------------------------------------------

      72972,795879      task-clock (msec)         #    0,772 CPUs utilized          
             4.971      context-switches          #    0,068 K/sec                  
               546      cpu-migrations            #    0,007 K/sec                  
               911      page-faults               #    0,012 K/sec                  
    57.575.878.331      cycles                    #    0,789 GHz                      (30,73%)
    60.626.232.521      instructions              #    1,05  insn per cycle           (38,43%)
    10.920.368.865      branches                  #  149,650 M/sec                    (38,40%)
       129.224.774      branch-misses             #    1,18% of all branches          (38,44%)
    18.526.168.088      L1-dcache-loads           #  253,878 M/sec                    (25,30%)
     2.885.163.789      L1-dcache-load-misses     #   15,57% of all L1-dcache hits    (17,75%)
       306.897.902      LLC-loads                 #    4,206 M/sec                    (17,32%)
       103.565.396      LLC-load-misses           #   33,75% of all LL-cache hits     (23,08%)
   <not supported>      L1-icache-loads                                             
       438.074.938      L1-icache-load-misses                                         (30,77%)
    18.537.122.526      dTLB-loads                #  254,028 M/sec                    (26,65%)
         8.112.288      dTLB-load-misses          #    0,04% of all dTLB cache hits   (18,35%)
         2.404.472      iTLB-loads                #    0,033 M/sec                    (15,39%)
         8.051.716      iTLB-load-misses          #  334,86% of all iTLB cache hits   (23,05%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      94,524422154 seconds time elapsed

性能ASM

....[Hottest Methods (after inlining)]..........................................................
  88.57%   [kernel.kallsyms]  [unknown] 
   6.70%        runtime stub  StubRoutines::jlong_disjoint_arraycopy 
   0.97%         c2, level 4  java.nio.channels.Channels::writeFully, version 2, compile id 697 
   0.92%         c2, level 4  sun.nio.ch.ChannelInputStream::read, version 2, compile id 706 
   0.24%    Unknown, level 0  sun.nio.ch.NativeThread::current, version 1, compile id 617 
   0.23%    Unknown, level 0  sun.nio.ch.FileDispatcherImpl::write0, version 1, compile id 656 
   0.22%         c2, level 4  org.example.FileBenchmark::mergeIO, version 4, compile id 725 
   0.19%                      <unknown> 
   0.18%  libpthread-2.27.so  __libc_write 
   0.18%  libpthread-2.27.so  __pthread_disable_asynccancel 
   0.15%  libpthread-2.27.so  __pthread_enable_asynccancel 
   0.13%           libjvm.so  ElfSymbolTable::lookup 
   0.13%  libpthread-2.27.so  __libc_read 
   0.11%    Unknown, level 0  sun.nio.ch.FileDispatcherImpl::read0, version 1, compile id 654 
   0.09%        libc-2.27.so  vfprintf 
   0.08%           libnio.so  fdval 
   0.06%           libnio.so  Java_sun_nio_ch_FileDispatcherImpl_write0 
   0.04%        libc-2.27.so  _IO_fwrite 
   0.03%        libc-2.27.so  _IO_default_xsputn 
   0.03%           libjvm.so  xmlStream::write_text 
   0.76%  <...other 201 warm methods...>
................................................................................................
 100.00%  <totals>

结论

如果我们查看两个摘要报告,我们可以看到 NIO 方法更快:3.733 s/opvs ,并且与 IO 方法相比,5.028 s/op它几乎不需要堆分配 ( 0.002 MB/secvs )。0.029 MB/sec

通过查看 perf 和 perfASM 报告,我们可以了解为什么 NIO 更快。

  • 与缺乏系统调用385相比,它会导致更少的页面错误911read()
  • 它不会花时间进行数组复制操作 ( StubRoutines::jlong_disjoint_arraycopy),从而尊重零复制名称 😄
【版权声明】本文为华为云社区用户原创内容,未经允许不得转载,如需转载请自行联系原作者进行授权。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。