强制写盘与JDK对其的支持、mmap、零拷贝等

系统层面

系统调用

fsync、fdatasync、sendfile、mmap

fsync、fdatasync

fsync, fdatasync - synchronize a file’s in-core state with storage device

fsync() transfers (“flushes”) all modified in-core data of (i.e., modified buffer cache pages
for) the file referred to by the file descriptor fd to the disk device (or other permanent
storage device) so that all changed information can be retrieved even if the system crashes or
is rebooted. This includes writing through or flushing a disk cache if present. The call
blocks until the device reports that the transfer has completed.

As well as flushing the file data, fsync() also flushes the metadata information associated
with the file (see inode(7)).

Calling fsync() does not necessarily ensure that the entry in the directory containing the file
has also reached disk. For that an explicit fsync() on a file descriptor for the directory is
also needed.

fdatasync() is similar to fsync(), but does not flush modified metadataunless that metadata is
needed in order to allow a subsequent data retrieval to be correctly handled. For example,
changes to st_atime or st_mtime (respectively, time of last access and time of last modifica‐
tion; see inode(7)) do not require flushing because they are not necessary for a subsequent
data read to be handled correctly. On the other hand, a change to the file size (st_size, as
made by say ftruncate(2)), would require a metadata flush.

sendfile

sendfile - transfer data between file descriptors

sendfile() copies data between one file descriptor and another. Because this copying is done
within the kernel
, sendfile() is more efficient than the combination of read(2) and write(2),
which would require transferring data to and from user space.

mmap

mmap, munmap - map or unmap files or devices into memory

mmap() creates a new mapping in the virtual address space of the calling process. The starting address for the new mapping is specified in addr. The length argument specifies the length of the
mapping (which must be greater than 0).

pagecache相关参数

pagecache的相关参数在vm(virtual memory)这一类中,这类参数中还包括了swap、oom、hugepages等参数,所以下面给出vm参数的文档:

  • 官方的

  • 国人翻译注解版 这同学博客写的还不错

    几个比较重要的:dirty_background_ratio(默认10),dirty_background_bytes,dirty_bytes,dirty_ratio(默认20)。如果你的存储系统是依靠pagecahce的,且在监控上看产生了io较高的情况,可以考虑调整参数来提高刷脏频率。

pdflush

老的内核版本(比如2.6)中才有pdflush flush等相关进程,新的(比如5.4)没有哈。

找不到pdflush flush相关进程的原因 还有这篇

There are no more dedicated flush threads anymore.

The Linux kernel has moved on to a worker thread model, where several generic worker threads can handle a variety of different tasks. You will see these in the process list as [kworker/#.##]

Unfortunately this new design makes it a bit difficult to tell exactly what any given kworker thread is doing at any given time. But you can rest assured that dirty pages are still being written to disk by one or more of the kworker threads.

到3.10.0 版本以后, 这个bdi-default 线程, flush 线程什么也都没有了, 都变成kworker, 然后把需要flush 的任务丢到这个kworker 队列里面就可以了

static int __init default_bdi_init(void)
{
	int err;

	bdi_wq = alloc_workqueue("writeback", WQ_MEM_RECLAIM | WQ_UNBOUND |
				 WQ_SYSFS, 0);
	if (!bdi_wq)
		return -ENOMEM;

	err = bdi_init(&noop_backing_dev_info);

	return err;
}

linux-5.4/mm/backing-dev.c 242行

关于buffer与cache

buffer对接块设备,绕过文件系统

cache对接文件系统

free用-w参数可以分开显示buffer和cache

♨ free -w -m -h
              总计         已用        空闲      共享       缓冲          缓存    可用
内存:         15G        5.0G        5.4G        2.4G        334M        4.7G        7.7G
交换:        2.0G          0B        2.0G

JDK中的FileChannel,FileDescriptor,MappedByteBuffer等

实例创建

MappedByteBuffer是由FileChannelmap方法map出来的,背后对应mmap64系统调用。

FileChannel可以由RandomAccessFilegetChannel方法拿到。

FileDescriptor可以由FileOutputStreamgetFD方法拿到,FileOutputStream也有getChannel方法拿到FileChannel实例。

如何强制刷盘

FileChannelforce方法的背后是fsync系统调用,可以强制将数据刷盘。PS:transferTo方法对应sendfile系统调用。

FileDescriptorsync方法背后是fsync系统调用,可以强制将数据刷盘。

MappedByteBufferforce方可以强制将数据刷盘。

PS:FileOutputStream的flush方法啥也没干…

strace force与transferTo

FileChannel的force方法

strace -tt -p 16491
12:42:51.653360 lseek(4, 0, SEEK_CUR)   = 46480                                                                                                                                                                    
12:42:51.653737 lseek(4, 0, SEEK_END)   = 46480                                                                                                                                                                    
12:42:51.654074 lseek(4, 46480, SEEK_SET) = 46480                                                                                                                                                                  
12:42:51.654366 lseek(4, 46480, SEEK_SET) = 46480                                                                                                                                                                  
12:42:51.654583 write(4, "hello-Mon May 24 12:42:51 CST 20"..., 35) = 35                                                                                                                                           
12:42:51.654981 futex(0x7f327000c578, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=12456, tv_nsec=317312815}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)                                                           
12:42:52.655296 futex(0x7f327000c528, FUTEX_WAKE_PRIVATE, 1) = 0                                                                                                                                                   
12:42:52.655699 fdatasync(4)

注意: -p 后面的接的是线程id,不是进程id

测试代码

public static void testSync() {
    RandomAccessFile file = null;
    try {
        file = new RandomAccessFile("/home/xx/temp-test.txt", "rw");
        FileChannel fileChannel = file.getChannel();
        while (true) {
            String text = new String("hello-" + (new Date()) + "\n");
            file.seek(file.length());
            ByteBuffer buffer = ByteBuffer.wrap(text.getBytes());
            fileChannel.write(buffer);
            TimeUnit.SECONDS.sleep(1L);
            fileChannel.force(false); 
        }
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } catch (InterruptedException e) {
        e.printStackTrace();
    } finally {
        if (file != null) {
            try {
                file.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

FileChannel的transferTo方法

#... 省略一部分
12:55:00.534128 lseek(3, 59662759, SEEK_SET) = 59662759
12:55:00.534258 read(3, "\312\376\272\276\0\0\0004\0\32\n\0\5\0\22\n\0\23\0\24\n\0\4\0\25\7\0\26\7\0\27\1"..., 386) = 386
12:55:00.534770 rt_sigaction(SIGRT_30, {sa_handler=0x7f7c5855fcc0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f7c7e3eb3c0}, {sa_handler=0x7f7c58776a20, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f7c7e3eb3c0}, 8) = 0
12:55:00.535202 fstat(4, {st_mode=S_IFREG|0664, st_size=53900, ...}) = 0
12:55:00.535460 lseek(3, 59661792, SEEK_SET) = 59661792
12:55:00.535827 read(3, "PK\3\4\n\0\0\10\0\0\317\275\321P\332^j*U\3\0\0U\3\0\0\31\0\0\0", 30) = 30
12:55:00.536186 lseek(3, 59661847, SEEK_SET) = 59661847
12:55:00.536464 read(3, "\312\376\272\276\0\0\0004\0)\n\0\t\0&\7\0'\5\377\377\377\377\377\377\377\376\5\377\377\377\377"..., 853) = 853
12:55:00.537106 sendfile(5, 4, [0] => [53900], 53900) = 53900

测试代码

public static void testTransferto() {
    RandomAccessFile srcFile = null;
    RandomAccessFile destFile = null;
    try {
        srcFile = new RandomAccessFile("/home/xx/temp-test.txt", "r");
        destFile = new RandomAccessFile("/home/xx/temp-test.txt1", "rw");
        FileChannel srcFileChannel = srcFile.getChannel();
        FileChannel destFileChannel = destFile.getChannel();
        while (true) {
            System.in.read();
            srcFileChannel.transferTo(0, srcFile.length(), destFileChannel);
        }
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        if (srcFile != null) {
            try {
                srcFile.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        if (destFile != null) {
            try {
                destFile.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

FileChannel与MappedByteBuffer的选用

使用注意点:

  1. MMAP 使用时必须实现指定好内存映射的大小,mmap 在 Java 中一次只能映射 1.5~2G 的文件内存,其中RocketMQ 中限制了单文件1G来避免这个问题
  2. MMAP 可以通过 force() 来手动控制,但控制不好也会有大麻烦
  3. MMAP 的回收问题,当 MappedByteBuffer 不再需要时,可以手动释放占用的虚拟内存,但使用方式非常的麻烦

释放这个事情我记得rocketmq通过反射干的

如何选用:

理论分析下,FileChannel 同样是写入内存,但比 MMAP 多了一次内核缓冲区与用户空间互相复制的过程,所以在极端场景下,MMAP 表现的更加优秀。

MMAP 并非是文件 IO 的银弹,它只有在一次写入很小量数据的场景下才能表现出比 FileChannel 稍微优异的性能。很多人认为:读 4kb 以下的数据请使用 mmap,大于 4kb 以上请使用 FileChannel。

4kb的问题还得靠测试验证得出不同机器的不同的合适值才行。

kafka写log是FileChannel,写索引是MappedByteBuffer。

rocketmq写log是MappedByteBuffer,据说是为了避开pagecache的锁的问题。

mysql是自己管的buffer并显式调用fsync。

关于零拷贝

Java的FileChannel的transferTo方法,背后系统调用是sendfile,sendfile是零拷贝。

这个零拷贝,并不是没有拷贝,而是指没有CPU参与的拷贝。 拷贝还是要拷的,不过是通过DMA(直接内存存取)完成的。比如数据在硬盘上要通过网卡发出去,不拷贝怎么发呢?

Wikipedia 的解释如下:

Zero-copy“ describes computer operations in which the CPU does not perform the task of copying data from one memory area to another. This is frequently used to save CPU cycles and memory bandwidth when transmitting a file over a network.

DMA是由硬件DMA控制器完成,DMA控制器可以在没有CPU参与的情况下直接访问内存,从而可以将硬盘的数据复制到内存,内存的复制到另外一段内存,内存的再复制到网卡等。 也就是说DMA传输方向有三个:外设到内存,内存到外设,内存到内存。

腾讯写的这篇比较详细可以参考