银杏叶书 chapter 5

为什么使用slab + buddy sys

buddy sys对大内存块，外部碎片和分配速度上表现良好, 但对小于页粒度的内存分配存在太多的内部碎片

因此linux引入了SLab作为buddy sys的补充, 核心思想是将一个page 划分成 meta data + 一系列的等长 slot, 这样在slab之中进行小内存分配的时候就可以兼顾较快的速度和较小的内部碎片

linux内核包括三种小对象管理方式，slab，slub和slob，slab维护复杂，开销较大; slob开销最小,但碎片指标等不佳,用于嵌入式等; linux默认使用折中的slub

在 https://hammertux.github.io/slab-allocator 也有论述

SLOB Allocator: Was the original slab allocator as implemented in Solaris OS. Now used for embedded systems where memory is scarce, performs well when allocating very small chunks of memory. Based on the first-fit allocation algorithm.
SLAB Allocator: An improvement over the SLOB allocator, aims to be very “cache-friendly”.
SLUB Allocator: Has better execution time than the SLAB allocator by reducing the number of queues/chains used.

slab in linux kernel

https://www.dingmos.com/index.php/archives/23/

https://freeflyingsheep.github.io/posts/kernel/memory/slab/

查看本机的slab

sudo cat /proc/slabinfo

# 专用部分
# ...
nsproxy             1344   1344     72   56    1 : tunables    0    0    0 : slabdata     24     24      0
vma_lock          366998 451350     40  102    1 : tunables    0    0    0 : slabdata   4425   4425      0
files_cache         1518   1518    704   46    8 : tunables    0    0    0 : slabdata     33     33      0
signal_cache        1732   1932   1152   28    8 : tunables    0    0    0 : slabdata     69     69      0
sighand_cache       1395   1395   2112   15    8 : tunables    0    0    0 : slabdata     93     93      0
task_struct         4897   5091  10496    3    8 : tunables    0    0    0 : slabdata   1697   1697      0
# 通用部分
# ...
dma-kmalloc-8k         0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-4k         0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-2k         0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-1k         0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-512        0      0    512   32    4 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-256        0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-192        0      0    192   42    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-128        0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-8k            64     64   8192    4    8 : tunables    0    0    0 : slabdata     16     16      0
kmalloc-4k           569    584   4096    8    8 : tunables    0    0    0 : slabdata     73     73      0
kmalloc-2k           752    752   2048   16    8 : tunables    0    0    0 : slabdata     47     47      0
kmalloc-1k           704    704   1024   32    8 : tunables    0    0    0 : slabdata     22     22      0
kmalloc-512         2112   2112    512   32    4 : tunables    0    0    0 : slabdata     66     66      0
kmalloc-256          768    768    256   32    2 : tunables    0    0    0 : slabdata     24     24      0
kmalloc-192         1050   1050    192   42    2 : tunables    0    0    0 : slabdata     25     25      0
kmalloc-128         1376   1376    128   32    1 : tunables    0    0    0 : slabdata     43     43      0
kmalloc-96          1134   1134     96   42    1 : tunables    0    0    0 : slabdata     27     27      0
kmalloc-64          3072   3072     64   64    1 : tunables    0    0    0 : slabdata     48     48      0
kmalloc-32          3072   3072     32  128    1 : tunables    0    0    0 : slabdata     24     24      0
kmalloc-16          6144   6144     16  256    1 : tunables    0    0    0 : slabdata     24     24      0
kmalloc-8          12288  12288      8  512    1 : tunables    0    0    0 : slabdata     24     24      0

The slab allocator provides two main classes of caches:slab分配器提供两种主要的缓存类型：

Dedicated: These are caches that are created in the kernel for commonly used objects (e.g., mm_struct, vm_area_struct, etc…). Structures allocated in this cache are initialised and when they are freed, they remain initialised so that the next allocation will be faster.

Dedicated：这些是在内核中为常用对象（例如，mm_struct、vm_area_struct等）。在这个缓存中分配的结构被初始化，当它们被释放时，它们保持初始化，以便下一次分配会更快。
Generic (size-N and size-N(DMA)): These are general purpose caches, which in most cases are of sizes corresponding to powers of two.

通用（大小为N和大小为N（DMA））：这些是通用缓存，在大多数情况下，其大小对应于2的幂。

In the SLUB allocator things are less complicated because it stops keeping lists (queues) of different types, each cpu etc… Also the data structures used by SLUB are less cluttered and complicated thanks to these adjustments. The only queue that the SLUB allocator manages is a linked list for every of the objects in each of the slab pages. The idea was to minimise TLB thrashing by associating a slab page to the CPU (known as CPU slab) instead of a queue, so that we are only allocating objects within that page, meaning that we will be accessing the same TLB entry.

如何获得更多物理内存资源

换页机制
- 原理：缺少内存时，将内存页换出到磁盘，修改对应进程页表的pte，同时在某些地方记录换出的地址; 当重新访问到这个页的时候，触发一个pagefault, os 发现这个虚拟页在合法的虚拟地址空间（linux vma）内，因而知道是换出到了磁盘。
  - 如何区分是换出（swap out）还是按需分配（demand paging）? 可以在pte上使用额外的标记位
  - 是内存用完了才会换出吗？不是，有可配置的watermark, linux 的内存资源小于低水位线时，开始启用换出，到高水位线停止; 如果内存资源比最小水位线还低，则立刻进行批量换出（依然可以配置）
  - 如何从物理页得到所有使用它的进程的pte： Linux reverse mapping // TODO
  - 优化：换页带来的性能损失可以通过预取（prefetching）进行降低，一个是减少trap的次数，另一个是可以batch processing和平均磁盘寻道等时间
- 换页策略
  - FIFO
  - Second Chance： FIFO + 每个页一个物理位，如果被访问就将其置1, 如果为1则在被evict的时候有一次回到队头的“免死机会”
  - LRU
  - 时钟算法：近似LRU，同样的每个页一个物理位，但因为没有队尾→队头的操作，比Second Chance更高效
  - …
- 处理thrashing：工作集模型，在前一段时间t内访问的pages预测为后一段时间t内的工作集，应该优先换出非工作集的页，而不是把所有的换入换出都交给LRU的不可控大手（由lru的size决定）。工作集追踪的思路在于额外对每个页保存一个时间戳
用vm节约物理资源
- 内存去重：kernel定时扫描，相同物理页合并（Linux KSM）(可能导致性能下降)
- 内存压缩：由于内存和磁盘的性能差得越来越大，作为换出的中间选项，可以把内存页压缩放到一段专门的区域（Linux zswap），访问时解压

性能优化

多个slab allocator(每个cpu一个) 减少锁竞争（TODO: 如何保证一致性？）
缓存着色（Cache Coloring/Page Coloring）：分配物理页的时候尽可能均匀地占用cpu的不同cache set, 在基础的分配连续块之外。单核上这个问题基本只需要保证分配的内存连续性，cache 的 set tag设计自然满足这一点，但在多核上如何优化以减少竞争是一个问题
- 业界方案：Intel CAT，ARMv8-A MPAM，用位图或者寄存器的值来指定一个进程可以使用哪些缓存
- ref: https://liulei-sys-inventor.github.io/files/Page Coloring的历史与发展.pdf // 待增加
NUMA: 尽可能让cpu访问靠近的内存，现代cpu架构之中近内存和远内存访问时延差距被拉开，对OS不再透明，linux libnuma库等，尽可能在本地节点内存分配

银杏叶书 chapter 5

为什么使用slab + buddy sys​

查看本机的slab​

如何获得更多物理内存资源​

性能优化​

为什么使用slab + buddy sys

查看本机的slab

如何获得更多物理内存资源

性能优化