Address Translation

导言

Split address translation (virtual-to-physical mapping) part to individual post

Excellent Video Resource

Digital Design & Computer Architecture - Lecture 26a: Virtual Memory II (ETH Zürich, Spring 2021)

Outstanding Blog or Overview Paper

Digital Design and Computer Architecture, ETH Zürich, Spring 2021

Elastic Cuckoo Page Tables: Rethinking Virtual Memory Translation for Parallelism¹

Overview motivation, designed-path and research-trend¶

stateDiagram-v2
  [*] --> Motivation
  state Motivation {
    %% nodes: can not be the same
    idea: virtual-to-physical space mapping
    concept: traslation metadata
    target: availability and access speed = latency * times
    %% link
    idea --> concept : store and access
    concept --> target
    %% style
    class target badBadEvent
  }
  Motivation --> DesignedPath
  state DesignedPath {
    %% nodes
    struct: Page-based VM：page frame and page table (nomore Segment)
    path1: smaller metadata
    path2: cache 2 lower latency
    %% link
    [*] --> struct: 1960s(chatgpt)
    struct --> path1
    struct --> path2

    state path1 {
        %% node
        p1w1: multi-level PT
        p1w2: superpage (ISCA'92 [^19])
        p1w3: hashed-based PT
        %% link
        p1w1 --> p1w2
        p1w2 --> p1w3
    }

    state path2 {
        %% node
        p2w1: TLB 1968[^15]
        p2w2: two level TLB (ISCA'92 [^17])
        p2w3: Shared L2 TLB (HPCA'11 [^16])
        p2w10: PWC (ISCA'10 [^26] [^27])
        %% link
        p2w1 --> p2w2
        p2w2 --> p2w3
        p2w3 --> p2w10
    }
  }
  DesignedPath --> HardwareTrend
  state HardwareTrend {
    %% nodes
    ht1: capacity gap between lastlevel caches and main memory is unlikely to shrink
    ht2: And graph apps mem-access more sparse
    ht3: btw~ 4 level radix tree PTW is constly

    ht1 --> ht2
    ht2 --> ht3: miss rate increse
  }
  HardwareTrend --> ResearchTrend :  Is VM still a good idea?(HPCA'10)
  state ResearchTrend {
    %% nodes
    trend1: **洋务运动/Conservatives** small changes to lubricate SYS
    trend2: **明治维新君主立宪**  Key component replacement
    trend3: **辛亥革命/Reformers** ab init redesign 

    %% link
    state trend1 {
        t1p1: TLBPrefetch(ISCA'02)
        t1p2: ISPASS'22 [^12]
        t1p3: MixTLBS(ASPLOS'17[^18])
        t1p4: ASPLOS'22 [^10]
        t1p5: UpTLBReach(MICRO'12/HPCA'14/ISCA'17)
        t1p6: Synergistic TLB(MICRO'10)
        t1p7: ContiguityAware TLBs(ISCA'19)
        t1p8: TLB Prefetchers(ASPLOS'10)
        t1p9: LargeMemTLB(ISCA'17)
        t1p10: OS Support(ASPLOS’19)
        t1p11: BigTLBsmallPage(ASPLOS’23)
        t1p12: Tailored Page Sizes(ISCA’20)

        state if_state1 <<choice>>
        [*] --> if_state1: Increase LB/Cache reach/hit
        if_state1 --> t1p1: TLB/PWC
        if_state1 --> t1p2: Cache/Prefetch
        if_state1 --> t1p3: Superpage

        t1p1 --> t1p8
        t1p8 --> t1p6
        t1p6 --> t1p5
        t1p5 --> t1p9
        t1p9 --> t1p7
        t1p7 --> t1p11

        t1p2 --> t1p4

        t1p3 --> t1p10
        t1p10 --> t1p12

    }

    state trend2 {
        %% node
        t2p1: 2 level fat-PT(ASPLOS'22 [^10])
        t2p2: hash-table (SIGMETRICS’16[^2])
        t2p3: Cuckoo-hash(ASPLOS'20[^3])

        state if_state2 <<choice>>
        [*] --> if_state2: abandon 4-level radix tree/Alternative PT
        if_state2 --> t2p1: small change
        if_state2 --> t2p2: other structure
        t2p2 --> t2p3
    }

    state trend3 {
        %% node
        t3p1: LargeMemManager(ISCA'13[^28])
        t3p4: RMemMapping(ISCA'15[^22])
        t3p2: NDP-AT(PACT'17 [^14])
        t3p3: Utopia(MICRO'23[^1])
        t3p5: Mosaic Pages(ASPLOS'23)


        state if_state3 <<choice>>
        [*] --> if_state3: contiguous V-Pages maps2 P-P(1)
        if_state3 --> t3p1: back2segment
        if_state3 --> t3p2: restrict AT
        t3p1 --> t3p4
        t3p2 --> t3p5
        t3p5 --> t3p3
    }
  }
  ResearchTrend --> DifferentiatedTrack
  state DifferentiatedTrack {
       track1: PIM
       track2: Virtual Sys

       state track1 {
            tr1p1: NDP-AT(PACT'17 [^14])
       }

       state track2 {
            tr2p1: Virtualization(MICRO'14[^24])
            tr2p2: BabelFish(ISCA'20)
            tr2p3: PTEMagne/vmitosis(ASPLOS'21)

            tr2p1 --> tr2p2
            tr2p2 --> tr2p3
       }
  }

mermaid : conflict

explained in details

Limiting associativity means that a page cannot reside anywhere in the physical memory but only in a fixed number of locations. For instance, direct-mapped VM maps each virtual page to a single page frame. Note that multiple virtual pages could map to the same physical frame, resulting in page conflicts. Increasing the associativity adds more flexibility to the page mapping and reduces conflicts.⁶
Due to rare page swapping, contiguous virtual pages are often mapped to contiguous physical pages [71], [72]⁶

Papers to be explored

To reduce TLB misses, recent studies have proposed to optimize TLB organizations by clustering, coalescing, contiguity [14, 18, 21, 42, 54–56, 64, 72], prefetching [15, 41, 63], speculative TLBs [9], and large part-of-memory TLBs [47, 62]. To increase TLB reach, support for huge pages has been extensively studied [21, 26, 27, 29, 43, 49, 51–53, 57, 60, 67, 69], with OS-level improvements [26, 43, 51, 52]. Other works propose direct segments [10, 25] and devirtualized memory [31], and suggest that applications manage virtual memory [2].¹

TODO: Related work in Utopia⁴

虚拟地址和物理地址¶

Initial idea of virtual memory

Idea: Give the programmer the illusion of a large address space while having a small physical memory⁵

So that the programmer does not worry about managing physical memory

在现代的操作系统中，为了让多任务的程序能方便的运行在操作系统上，需要完成的一个很重要的抽象是「每个程序有自己的地址空间，且地址空间范围(虚拟地址长度决定)是一样的」。
程序自身看到的地址空间，就是虚拟内存。而访问虚拟内存的地址就是虚拟地址（Virtual Address），与之对应的是物理地址（Physical Address）。
这样的设计会导致上层的应用程序可能会访问同一个值相等的虚拟地址，所以操作系统需要做的就是替这些程序维护这个虚拟地址到物理地址的映射(页表)。甚者，为了统一和连贯，内核自己本身访问内存也将会通过虚拟地址。

Benefits of Automatic Management of Memory

Programmer does not deal with physical addresses
Each process has its own independent mapping metadata from virtual 2 physical addresses

Enables:⁵

Code and data to be located anywhere in physical memory (relocation and flexible location of data)
Isolation/separation of code and data of different processes in physical processes (protection and isolation)
Code and data sharing between multiple processes (sharing)

VM make Physical Memory as a Cache

Physical memory is a cache for pages stored on disk(1). In fact, it is a fully-associative cache in modern systems (a virtual page can potentially be mapped to any physical frame)⁵

This is because caching is a equal object mapping technique. And page and frame is the equal object pair.

Similar caching issues exist as we have covered earlier:

Placement: where and how to place/find a page in cache?
Replacement: what page to remove to make room in cache?
Granularity of management: large, small, uniform pages?
Write policy: what do we do about writes? Write back?

L1 cache for cache line in DRAM. Because block size is always cache line size.

rCore V2P mapping

如下图所示，这里的图表示了非教学版 rCore 的虚拟地址和物理地址的映射关系。可以看到内核的数据放在了一段高虚拟地址空间，然后会映射到 0x80200000 开始的一段低物理地址空间；而所有的用户程序，将通过操作系统维护的页表映射到不同的物理空间。

注意，编译过程中，链接后各个段在虚拟空间上的地址就确定了.

Key Observation : Full associate VA2PA mapping is not necessary

Paper NAT⁶ show that modern server workloads do not need a fully associative VM and can tolerate associativity ranging from direct-mapped(1) to 4-way associative.

Table II (left) shows the page conflict(2) rate as associativity varies. As shown, little associativity is enough to eliminate all page conflicts and match fully associative VM.

Table II (right) estimates the average memory access time (AMAT) increase due to page conflicts.

MY DEDUCTION: Antoher words, most apps's virtual space is focus in 8GB * 2 = 16GB.

Table IV.

Total segments represents the total number of virtual segments.
99% coverage indicates the number of virtual segments required to cover 99% of the physical address space.
Largest segment shows the fraction of the physical address space covered with the largest virtual segment.
Largest 32 segments shows the fraction of the physical space covered with the largest 32 segments.

First, for some applications, such as MySQL and Memcached, a single large segment covers most of the physical memory, and therefore Direct map would eliminate most TLB misses

Third, although there could be hundreds of segments, the associativity requirements for 8GB dataset(Table II) indicate that associativity can be reduced to a small number. The reason is that although segments are not fully contiguous, the OS tends to cluster the segments (as shown in Figure 3), and therefore nearby segments do not conflict with each other.

cache-like map to 8GB physical memory
multiple virtual pages(high bits different) could map to the same physical frame, resulting in page conflicts.

How to deal with Page conflict which caused by Restrict-associativity?

GUESS: maybe deal with it like when hash conflict. Utopia⁴ use conventional flexReg to assure the robustness.

Shared Virtual Memory for Heterogeneous Systems

Advantage:

Unified virtual memory enables “pointeris-a-pointer” semantics, thus avoiding explicit and expensive data copies.
More importantly, it provides a flat address space that is familiar to common programmers,
while enforcing the required protection mechanisms to prevent compromising the security of the system.⁶

Drawback:

high translation latency to hundreds of nanoseconds. ²⁰

Linux 虚拟内存系统¶

segmentation¶

仔细分析虚拟内存位于用户栈之上。

Linux 将虚拟内存组织成一些区域（也叫做段）的集合。一个区域（area）就是已经存在着的（已分配的）虚拟内存的连续片（chunk），这些页是以某种方式相关联的。例如，代码段、数据段、堆、共享库段，以及用户栈都是不同的区域。每个存在的虚拟页面都保存在某个区域中，而不属于某个区域的虚拟页是不存在的，并且不能被进程引用。

virtual address space layout of a Linux process

Figure 3 shows the virtual address space layout of a Linux process, featuring six virtual segment groups:

the read-only segments, which consist of the binaries (.text) and globally visible constants (.ro);
the read-write segments, containing global variables (.rw and .bss);
the heap segment, which holds dynamically allocated objects;
the mmap segments, for objects allocated through the mmap syscall;
the stack segment;
and the kernel address space.

We assume only the dark-colored segments are visible to MPUs. The virtual address space that is not exposed to MPUs (e.g., the kernel space) still enjoys full associativity(1).

Guess: kernel is used frequently.

一个具体区域的区域结构包含下面的字段：

vm_start：指向这个区域的起始处。
vm_end：指向这个区域的结束处。
vm_prot：描述这个区域内包含的所有页的读写许可权限。
vm_flags：描述这个区域内的页面是与其他进程共享的，还是这个进程私有的（还描述了其他一些信息）。
vm_next：指向链表中下—区域结构。

virtual address usage in Linux

pmap命令会打印出pid程序的虚拟地址分配，数据来自 /proc/$pid/maps 文件

$ pmap 1233735|head -n 10
1233735:   /staff/shaojiemike/github/sniper_PIMProf/lib/sniper -c /staff/shaojiemike/github/sniper_PIMProf/config/base.cfg --general/total_cores=1 --general/output_dir=/staff/shaojiemike/github/sniper-pim/log/default/mlp_pimprof_cpu_1 --config=/staff/shaojiemike/github/sniper_PIMProf/config/pimprof_cpu.cfg -g --general/magic=true -g --traceinput/stop_with_first_app=true -g --traceinput/restart_apps=false -g --traceinput/stop_with_first_app=false -g --traceinput/enabled=true -g --traceinput/emulate_syscalls=true -g --
0000000000400000     40K r---- sniper
000000000040a000   2184K r-x-- sniper
000000000062c000    972K r---- sniper
000000000071f000     24K r---- sniper
0000000000725000     12K rw--- sniper
0000000000728000    112K rw---   [ anon ]
00000000011a4000   3748K rw---   [ anon ]
00007f8d80000000  85380K rw---   [ anon ]

代码段（Code Segment）：

0000000000400000 开始的 40K 区域，权限为 r----。这是可执行代码段，具有只读权限。

数据段（Data Segment）：

000000000040a000 开始的 2184K 区域，权限为 r-x--。这可能是包含可执行代码和只读数据的段。 000000000062c000 开始的 972K 区域，权限为 r----。这可能是只读数据段。

段错误与非法操作¶

假设 MMU 在试图翻译某个虚拟地址 A 时，触发了一个缺页。这个异常导致控制转移到内核的缺页处理程序，处理程序随后就执行下面的步骤：

虚拟地址 A 是合法的吗？换句话说，A 在某个区域结构定义的区域内吗？为了回答这个问题，缺页处理程序搜索区域结构的链表，把 A 和每个区域结构中的 vm_start 和 vm_end 做比较。如果这个指令是不合法的，那么缺页处理程序就触发一个段错误，从而终止这个进程。这个情况在图 9-28 中标识为 “1”。
因为一个进程可以创建任意数量的新虚拟内存区域（使用在下一节中描述的 mmap 函数），所以顺序搜索区域结构的链表花销可能会很大。因此在实际中，Linux 使用某些我们没有显示出来的字段，Linux 在链表中构建了一棵树，并在这棵树上进行查找。
试图进行的内存访问是否合法？换句话说，进程是否有读、写或者执行这个区域内页面的权限？例如，这个缺页是不是由一条试图对这个代码段里的只读页面进行写操作的存储指令造成的？这个缺页是不是因为一个运行在用户模式中的进程试图从内核虚拟内存中读取字造成的？如果试图进行的访问是不合法的，那么缺页处理程序会触发一个保护异常，从而终止这个进程。这种情况在图 9-28 中标识为 “2”。

注意Segmentation Fault 只会在缺页时触发，声明 int A[10] , 访问 A[11] , 不一定会触发Segmentation Fault（段错误）

Memory Management Unit (MMU)'s functions

In CPU (Central Processing Unit) design, the Memory Management Unit (MMU) is a crucial component responsible for memory management and address translation. The MMU serves several key functions:

Virtual Memory Address Translation: One of the primary roles of the MMU is to translate virtual memory addresses used by software into physical memory addresses. This allows for the implementation of virtual memory, a memory management technique that provides the illusion of a vast, contiguous address space to applications, even when physical memory is limited.
Address Protection: The MMU enforces access control by preventing unauthorized access to memory locations. It ensures that a running program can only access memory it has permission to access. This helps maintain the security and integrity of the system.
Page Tables: The MMU uses page tables to map virtual addresses to physical addresses. Page tables are data structures that store the mapping information. The MMU consults these tables to perform address translation.
Address Space Isolation: The MMU provides address space isolation, ensuring that each running process or application has its own isolated memory space. This isolation prevents one process from inadvertently or maliciously accessing the memory of another process.
Caching: The MMU may manage memory caching to improve memory access speeds. It can cache frequently used memory locations, reducing the need to access slower main memory. Caches are typically organized into levels, such as L1, L2, and L3 caches, and are an integral part of modern CPU design.
Memory Protection: The MMU is responsible for implementing memory protection mechanisms, such as read-only, read-write, execute, and no-access permissions for different memory regions. It ensures that memory regions are used according to their intended purpose.
TLB (Translation Lookaside Buffer): The MMU may include a TLB, which is a cache for frequently used address translations. The TLB accelerates the address translation process by storing recently used mappings. When a virtual address needs to be translated to a physical address, the TLB is checked first to see if the translation is already present, avoiding the need to consult the page table.
Segmentation: Some MMUs also support memory segmentation, which divides the address space into segments or regions with different attributes. Each segment can have its own base address, limit, and access permissions. Segmentation is common in older CPU architectures like x86.

In summary, the Memory Management Unit in CPU design is responsible for managing memory address translation, protection, isolation, and caching, ensuring that the CPU can efficiently and securely interact with the system's memory subsystem while providing the illusion of a larger, contiguous virtual address space to applications through the concept of virtual memory.

页表数据结构¶

Page table: table that stores virtual 2 physical page mappings lookup table used to translate virtual page addresses to physical frame addresses (and find where the associated data is)⁵
Page size: the mapping granularity of virtualphysical address spaces ,dictates the amount of data transferred from hard disk to DRAM at once
页表是记录每一虚拟页在内存中缓存的物理块页号。
每次地址翻译硬件将一个虚拟地址转换为物理地址时，都会读取页表。

页表的基本组织结构¶

页表就是一个页表条目（Page Table Entry，PTE）的数组。
- 虚拟地址空间(虚拟地址)中的每个页在页表中一个固定偏移量处都有一个 PTE。
- 简化来说，每个 PTE 是由一个有效位（valid bit）和一个 n 位地址字段组成的。
  - 有效位表明了该虚拟页当前是否被缓存在 DRAM 中。
  - 如果设置了有效位，那么地址字段就表示 DRAM 中相应的物理页的起始位置，这个物理页中缓存了该虚拟页。
  - 如果没有设置有效位，那么一个空地址表示这个虚拟页还未被分配。否则，这个地址就指向该虚拟页在磁盘上的起始位置。
下图就是个示例

特点：按需页面调度¶

按需页面调度：如下图，只有实际驻留在物理内存空间中的页(已缓存的)才会对应着物理块。
如果不命中，系统必须判断这个虚拟页存放在磁盘的哪个位置，在物理内存中选择一个牺牲页，并将虚拟页从磁盘复制到 DRAM 中，替换这个牺牲页。
- 未缓存的：未缓存在物理内存中的已分配页，在磁盘上有对应位置。
- 未分配的：VM 系统还未分配（或者创建）的页。未分配的块没有任何数据和它们相关联，因此也就不占用任何磁盘空间。

特点：进程的独立页表和独立虚拟地址空间¶

操作系统为每个进程提供了一个独立的页表，因而也就是一个独立的虚拟地址空间。
注意，多个虚拟页面可以映射到同一个共享物理页面上。
利用这点，多个程序可以使用同一个动态库文件(.so)文件。比如printf

Linux内存分段分页机制¶

下图展示了虚拟地址进过分段、分页机制后转化成物理地址的简单过程。其实分段机制是intel芯片为兼容以前产品而保留下来的，然后linux中弱化了这一机制。下面我们先简单介绍一下分段机制：

分段机制隔绝了各个代码、数据和堆栈区域，它把处理器可寻址的线性地址空间划分成一些较小的称为段的受保护地址空间区域。
- 每个逻辑段对应着一个自然的片段，如函数、数组等，每个逻辑段的长度可以根据需要动态地进行调整。每个逻辑段都有自己的段基址和段长度，当程序执行时，将它们映射到物理内存的若干个物理块中。
- 好处：方便了程序员对程序的管理和调试，同时还可以缓解内存碎片的问题，提高内存利用率。
分页机制会把线性地址空间（段已映射到其中）划分成页面，然后这些线性地址空间页面被映射到物理地址空间的页面上。
不同之处：分段机制主要针对程序的逻辑单元进行内存管理，而分页机制则是针对物理内存进行内存管理，把内存视为一个固定大小的块进行划分。

segmentation and paging in IntelSDM

Segmentation provides a mechanism of isolating individual code, data, and stack modules so that multiple programs (or tasks) can run on the same processor without interfering with one another. ⁷

Paging provides a mechanism for implementing a conventional demand-paged, virtual-memory system where sections of a program’s execution environment are mapped into physical memory as needed. Paging can also be used to provide isolation between multiple tasks.

80x86 分段 + 两级页表实例

线性地址：

线性地址的低12位给出了页面中的偏移量。
线性地址的高20位构成这个数组的引索值，用于选择对应页面的物理基地址。
- 所以页表要含有2^20（1M）个表项。
  - 如果作为一个表来存放的话，而每项占用4个字节，最多将占用4MB内存。
  - 因此为了减少内存占用量，80x86适用了两级页表，高20位线性地址到物理地址的转换也被分成两步进行，每部适用其中10个比特。
- 页表中的页表项大小为32位。由于只需要其中20位来存放页面的物理基地址，因此剩下的12位可用于存放诸如页面是否存在等属性信息。如果线性地址引索的页表项被标注为存在，我们就从页面中取得物理地址。如果表项中不存在，那么访问对应物理页面时就会产生异常。

两级页表：

第一级表称为页目录。它被存放在1页中（4k大小），具有2^10（1k）个4字节长度的表项。这些表项指向二级表。它们由线性地址最高10位作为引索。

第二级表称为页表，长度也是1个页面。线性地址高10位获取指向第二级页表的指针，再加上中间10位，就可以在相应页表中获得物理地址的高20位。

如下图：两级页表有2^20(1M)项，可以确定页帧/页基地址(4G中第几个4K的页表)。后面的12位页内偏移，正好确定是页内的哪一项。

ISCA'13: direct-segment + page-based VM

As depicted in Figure 2, on each data memory reference, data virtual address V is presented to both the new direct-segment hardware and the D-TLB. If virtual address V falls within the contiguous virtual address range demarcated by the direct segment’s base and limit register values (i.e., BASE ≤ V < LIMIT), the new hardware provides the translated physical address as V + OFFSET and suppresses the D-TLB translation process. ⁸

Notably, addresses translated using direct segments never suffer from TLB misses.

ISCA'15: co-design of direct-segment + page-based VM in parallel with TLB

⁹

如何使用页表：地址翻译¶

Address translation: the process of determining the physical address from the virtual address
下图展示了 MMU 如何利用页表来实现这种映射。
- CPU 中的一个控制寄存器，页表基址寄存器（Page Table Base Register，PTBR）指向当前页表。
- n 位的虚拟地址包含两个部分：一个 p 位的虚拟页面偏移（Virtual Page Offset，VPO）和一个位的虚拟页号（Virtual Page Number，VPN）。
MMU 利用 VPN 来选择适当的 PTE。例如，VPN 0 选择 PTE 0，VPN 1 选择 PTE 1，以此类推。将页表条目中物理页号（Physical Page Number，PPN）和虚拟地址中的 VP。串联起来，就得到相应的物理地址。
注意，因为物理和虚拟页面都是 P 字节的，所以物理页面偏移（Physical Page Offset，PPO）和 VPO 是相同的。

页面命中，硬件执行流程¶

第 1 步：处理器生成一个虚拟地址VA，并把它传送给 MMU。
第 2 步：MMU 生成 PTE 物理地址，并从高速缓存/主存请求得到它。(页表内容也能被缓存？)
第 3 步：高速缓存/主存向 MMU 返回 PTE。
第 4 步：MMU 构造物理地址，并把它传送给高速缓存/主存。
第 5 步：高速缓存/主存返回所请求的数据字给处理器。

缺页，硬件执行流程¶

页面命中完全是由硬件来处理的，与之不同的是，处理缺页要求硬件和操作系统内核协作完成，如下图所示。

第 1 步到第 3 步：相同。
第 4 步：PTE 中的有效位是零，所以 MMU 触发了一次异常，传递 CPU 中的控制到操作系统内核中的缺页异常处理程序。
第 5 步：缺页处理程序确定出物理内存中的牺牲页，如果这个页面已经被修改了，则把它换出到磁盘。
第 6 步：缺页处理程序页面调入新的页面，并更新内存中的 PTE。
第 7 步：缺页处理程序返回到原来的进程，再次执行导致缺页的指令。CPU 将引起缺页的虚拟地址重新发送给 MMU。因为虚拟页面现在缓存在物理内存中，所以就会命中。

页表的存储、结合高速缓存和虚拟内存¶

大部分系统的cache是选择物理寻址的
页表是需要一直驻留在物理内存中的（多级页表除外）。注意，页表条目可以缓存，就像其他的数据字一样，因为地址翻译发生在高速缓存查找之前。如下图

利用 TLB 加速地址翻译¶

每次 CPU 产生一个虚拟地址，MMU 就必须查阅一个 PTE，以便将虚拟地址翻译为物理地址。
- 在最糟糕的情况下，这会要求从内存多取一次数据(多级页表更多次)，代价是几十到几百个周期。
- 如果 PTE 碰巧缓存在 L1 中，那么开销就下降到 1 个或 2 个周期。
然而，许多系统都试图消除即使是这样的开销，它们在 MMU 中包括了一个关于 PTE 的小的缓存，称为翻译后备缓冲器（Translation Lookaside Buffer，TLB）。
TLB 是一个小的、虚拟寻址的缓存，其中每一行都保存着一个由单个 PTE 组成的块。TLB 通常有高度的相联度。
- 如图所示，用于组选择和行匹配的索引和标记字段是从虚拟地址中的虚拟页号中提取出来的。如果 TLB 有个组，那么 TLB 索引（TLBI）是由 VPN 的 t 个最低位组成的，而 TLB 标记（TLBT）是由 VPN 中剩余的位组成的。
下图展示了当 TLB 命中时（通常情况）所包括的步骤。这里的关键点是，所有的地址翻译步骤都是在芯片上的 MMU 中执行的，因此非常快。

VA PA 寻址

TLB是VA寻址的，和cache使用PA寻址不同。

TLB的深入研究可以参考 2017 A_Survey_of_Techniques_for_Architecting_TLBs

TLB Cache 的 VA PA 细节¶

TLB support multi-size page¶

Enabling Page Size Extension,PSE (by setting bit 4, PSE, of the system register CR4) changes this scheme.
The entries in the page directory have an additional flag, in bit 7, named PS (for page size). This flag was ignored without PSE, but now, the page-directory entry with PS set to 1 does not point to a page table, but to a single large 4 MiB page. The page-directory entry with PS set to 0 behaves as without PSE.

ASPLOS'17 MIX TLB

Commercial systems typically implement separate set-associative TLBs for different page sizes.¹⁰ This means that when superpages are allocated aggressively, TLB misses may, counter intuitively, increase even if entries for small pages remain unused (and vice-versa).

MICRO'15 Large Pages and Lightweight Memory Management ……

TO DO¹¹

TLB的多级结构¶

TLB（Translation Lookaside Buffer）在物理结构上可以有多级结构，也可以只有单级结构，具体取决于处理器的架构和设计。如Intel 的skylake就明显有两级TLB。

Attention: L1 Virtually Indexed Cache¶

Skylake架构中,DTLB位于L1数据缓存之后。这说明Skylake的L1数据缓存是虚拟地址索引的缓存(Virtually Indexed Cache), L2 就是物理地址。

在虚拟地址索引的缓存中,缓存的索引使用的是虚拟地址的一部分,而不是物理地址。主要有以下几点原因:

简化缓存访问，提高访问速度:可以直接拿到指令的虚拟地址进行索引,不需要等待地址翻译。避免了需要先查询TLB才能获取物理地址再索引的额外延迟。
缩小关键路径:允许并行进行缓存访问和地址翻译,缩短了关键路径延迟。

更高命中率:由于运行程序使用的都是虚拟地址,使用虚拟地址索引可以提供更高的时间局部性,提高缓存命中率。简化一致性:由于L1缓存仅被一个core访问,使用虚拟地址索引可以简化缓存一致性协议。但是,虚拟地址索引也存在一个问题,就是当地址映射关系变更时(比如，进程上下午切换),可能需要刷新或回写缓存行。总体来说,Skylake的设计选择了使用虚拟缓存来获得访问速度上的优势

TLB flush when context switch

In conventional systems, whenever a virtual-to-physical mapping gets modified (e.g., due to a page migration or a page deallocation), all the affected TLB entries of all the running processes are invalidated to maintain TLBs coherent.⁴

多核的进程、线程切换时，TLB，页表如何处理？

TLB是每个核私有的，如果一个核从一个进程切换到另一个进程，TLB要全部清空。
但是线程不需要，因为线程共享相同的虚拟地址空间。
所以线程切换开销远小于进程切换开销

同一个核上同一个进程内的线程切换，需要TLB全部flush吗

对于同一个核上,同一个进程内的不同线程之间进行切换,通常不需要完全刷新TLB。

主要原因如下:

同一进程内的不同线程共享相同的地址空间和虚拟内存映射,所以不会有地址空间的改变。现代处理器使用地址空间标识符(ASID)来区分不同进程的TLB项。对于同一进程内的线程切换,ASID没有改变,所以TLB项本身是可重用的。 Intel处理器会为每个core维护不同的PCID(进程上下文ID),同一进程内的线程切换不会改变PCID,所以TLB项依然有效。操作系统通常采用“当前CPU拥有”(current CPU owns)的方式管理TLB,同一核心上的线程切换不会引入争用问题。即使存在部分无效的TLB项,硬件也会自动标记,不需要手动刷新。线程切换的开销主要在于寄存器状态、线程STACK的保存恢复等,对TLB基本无影响。所以除非涉及地址映射关系的改变,否则同核同进程内线程切换不需要完全刷新或重建TLB,可以直接重用,这也是切换开销较低的原因。

只有跨进程或跨核心的切换才需要TLB刷新,来使得地址翻译与新的地址空间一致。

同一个进程内的线程a从A核心切换到B核心，B核心原本运行的就是同一个进程的另一个线程c，这两个核心的TLB需要全部flush吗

对于同一个进程内的线程a从核心A切换到核心B的情况,如果核心B上原本已经运行了同一进程的另一个线程c,那么是否需要完全刷新两个核心的TLB,分析如下:

由于是同一进程的不同线程,地址空间是相同的,所以TLB的内容理论上是可重用的。但是,不同核心CPUContains不同的TLB,其内容未必完全一致。 Intel处理器使用核心级的PCID（Process Context IDentifiers (PCID)）来标识TLB所属的地址空间。两个核心的PCID可能不同。为确保TLB一致性,该场景下操作系统需要将核心B上的PCID更新为和核心A相同。这会导致核心B上的TLB全部失效,需要重新填充。所以对核心B来说,需要完全刷新TLB。但核心A的TLB仍然有效,因为地址空间没有改变,无需刷新。所以线程a迁移时,源核A的TLB保持不变,目标核B的TLB需要完全重建。这是因为多核心情况下,不同核心的TLB可能存在不一致状态。线程迁移需要重新同步TLB的状态,因此目标核上的TLB需要刷新。

所以综上,线程a在跨核迁移时,源核A的TLB可以重用,目标核B的TLB需要完全刷新,以确保一致性。这是多核心架构下的特有情况。这需要硬件有特殊支持，来实现不同核能识别同一个进程的线程，而不是直接清空刷新。

关于表示进程，Linux kernel有[对应源码](https://livegrep.com/search/linux?q=switch_mm(struct%20mm_struct%20prev%2C%20struct%20mm_struct%20next%2C%20&fold_case=auto&regex=false&context=true)

小结¶

多级页表¶

缘由¶

到目前为止，我们一直假设系统只用一个单独的页表来进行地址翻译。
如果我们有一个 32 位的地址空间、4KB 的页面(4*1K = 12位的PPO，最多余下20位VPN = 1M，说明有1M个页表项)和一个 4 字节的 PTE，那么即使应用所引用的只是虚拟地址空间中很小的一部分，也总是需要一个 4MB 的页表驻留在内存中。对于地址空间为 64 位的系统来说，问题将变得更复杂。
用来压缩页表大小的常用方法是使用层次结构的页表。

多级页表构建实例¶

假设 32 位虚拟地址空间被分为 4KB 的页，而每个页表条目都是 4 字节。
还假设在这一时刻，虚拟地址空间有如下形式：内存的前 2K 个页面分配给了代码和数据，接下来的 6K 个页面还未分配，再接下来的 1023 个页面也未分配，接下来的 1 个页面分配给了用户栈。
下图展示了我们如何为这个虚拟地址空间构造一个两级的页表层次结构。
一级页表中的每个 PTE 负责映射虚拟地址空间中一个 4MB 的片（chunk），负责1个二级页表。二级页表中的每个 PTE 都负责映射一个 4KB 的虚拟内存页面。
每个一级和二级页表都是 4KB 字节，这刚好和一个页面的大小是一样的。
优点：
- 第一，如果一级页表中的一个 PTE 是空的，那么相应的二级页表就根本不会存在。对于一个典型的程序，4GB 的虚拟地址空间的大部分都会是未分配的。
- 第二，只有一级页表才需要总是在主存中，这就减少了主存的压力；只有最经常使用的二级页表才需要缓存在主存中。

多级页表地址翻译¶

Core i7 实例¶

层次结构TLB 是虚拟寻址的，是四路组相联的。L1、L2 和 L3 高速缓存是物理寻址的，块大小为 64 字节。

CR3 控制寄存器指向第一级页表（L1）的起始位置。CR3 的值是每个进程上下文的一部分，每次上下文切换时，CR3 的值都会被恢复。

four levels of the page table

Such levels are known as PGD (Page Global Directory), PUD(Page Upper Directory), PMD (Page Middle Directory), and PTE (Page Table Entry)¹

or In Intel is PML4(Page Map Level4), PDP(Page Directory Page table), PD(Page Directory), PT(Page Table)⁴¹²

48-Bit Virtual Addresses

While 32-bit machines can address a maximum of 4GB of virtual address space, 48-bit machines have the remarkable capability to access up to 256TB of virtual address space. This expanded address range is especially significant in scenarios where large amounts of memory need to be managed and accessed efficiently.

Page Table Space Requirements¶

In a 64-bit machine, each Page Table Entry (PTE) is 64 bits, which is equivalent to 8 bytes.
With a 4-level page table structure, each level consists of $2^9 = 512$ entries, and therefore requires $512 * 8$ bytes, which equals 4KB.

Managing a 2TB Dataset

Suppose we have a massive 2TB dataset in memory with 512 million entries, each using a 4KB page size. Here's how the page table space is allocated at each level:

The L4 Page Table has 512 million entries and occupies 4GB of memory.
The L3 Page Table contains 1 million entries and occupies 8MB.
The L2 Page Table has 2,000 entries, requiring 16KB of memory.
The L1 Page Table holds 4 entries and occupies just 32 bytes.

In total, the page table space required for managing this dataset sums up to $4GB + 8MB + 16KB + 32B$, which is much larger than the total caching capacity of a modern high-end CPU. ⁴

Papers' Motivations¶

Experiment Object¶

contiguous VM regions directly to contiguous VM.¶

Charles Thacker, 2010 ACM Turing Award Lecture.

“Virtual memory was invented in a time of scarcity. Is it still a good idea?”⁸

TLB and PWC can not cover the huger DRAM range.
contiguous virtual memory regions directly to contiguous physical memory. -> More direct-map design⁸⁹

TLB miss rate for hash-table

Figure 2 compares the TLB miss rate for hash-table probes over a 32GB working set for 4KB and 2MB pages.

The figure indicates that even with large pages and a TLB of 1K entries, for every thousand instructions, there are 40 TLB misses, each requiring a page table walk.⁶

L2 TLB MPKI

Figure 3 shows the L2 TLB MPKI of the baseline system, as we increase the L2 TLB size from 1.5K entries up to 64K entries, for 11 memory-intensive workloads.⁴

We observe that the baseline 1.5K-entry L2 TLB suffers from high average MPKI, 39 on average and up to 77. Even using a drastically larger 64K-entry L2 TLB, the average MPKI remains high at 24 (and up to 54), resulting in frequent PTWs.

ideal system with perfect TLB

Figure 6 shows the execution time speedup of ECH and P-TLB(1) compared to Radix. ⁴

We observe that P-TLB outperforms Radix by 30% and ECH by 22%.

We conclude that there is room for further improving the performance of address translation.

Every translation request hits in the L1 TLB.

High performance overheads¶

Page Walk is costly: 1 TLB miss cause 4 references

Radix page tables as implemented in the x86-64 architecture incur a penalty of four memory references for address translation upon each TLB miss. These 4 references become 24 in virtualized setups(1), accounting for 5%–90% of the runtime¹³

¹³

Hardware-assisted virtualization utilizes two layers of page tables, one for the host system and one for the guest virtual machine. The translation process of simultaneously walking these page tables is done in a “two-dimensional” (2D) manner, requiring 24 instead of 4 memory references.

MICRO'14: reduce 2D overhead using direct segment

¹⁴

average PTW latency

Figure 4 shows the average PTW latency (in processor cycles) for Radix and ECH.

We observe that Radix spends 137 cycles and ECH 86 cycles, on average, to complete the PTW.

radix page table lack parallelism¶

its sequential pointer-chasing operation misses an opportunity: it does not exploit the ample memory-level parallelism that current computing systems can support.¹

High interference in memory hierarchy¶

translation-induced interference

Utopia presents an intriguing perspective: the presence of extensive translation metadata can disrupt the memory hierarchy, affecting CPU caches, interconnects, and main memory.

As discussed in the earlier section "页面命中，硬件执行流程" (Page Hits, Hardware Execution Process), accessing the Page Table Entry (PTE) involves making a physical address access to CR3+VPN1. In the event of a Translation Lookaside Buffer (TLB) miss, one address translation leads to four PTE accesses under the 4-level Page Table (PT) structure. Given that the cache capacity is approximately 8MB, it is reasonable to speculate that four PTE accesses may result in a cache miss approximately half the time, consequently doubling or even tripling the overhead compared to a standard memory access. This insight highlights the significant impact of translation metadata on system performance and the need to address this issue effectively.

Even using the state-of-the-art hash-based PT design in a system that supports both 4KB and 2MB pages, a PTW takes an average of 86 cycles (up to 123) to complete, across 11 data-intensive workloads. ⁴

breakdown of the servicing location (DRAM, LLC, L2) of memory requests

Figure 5 demonstrates the breakdown of the servicing location (DRAM, LLC, L2) of memory requests to access the PT in both Radix and ECH, normalized to Radix. We make two key observations. ⁴

First, an average of 43% of the PT requests are serviced from DRAM, in Radix. This is the key reason behind the long average PTW latency of Radix (137 cycles).
Second, although ECH reduces the fraction of PT requests that hit in the DRAM, it increases the total number of memory requests (to access the PT) by 62% on average compared Radix.

L2 TLB is useless

According to last experiment, L2 TLB seems useless.

fraction of cache blocks of two caches (L2, LLC) that store PT data

Figure 7 shows the fraction of cache blocks of two caches (L2, LLC) that store PT data (L1 typically does not store PT entries [66–68]), averaged across 500 epochs of 1M instructions, for Radix and ECH. ⁴

We observe that both Radix and ECH use significant fraction of cache capacity in the cache hierarchy. For example, Radix and ECH respectively use 33% and 57% of L2’s total capacity for PT entries.

The high usage of cache blocks for PT entries reduces the effective capacity of the cache hierarchy, which otherwise could have been used to store the data of (i) the running application and (ii) other applications running on the system if the LLC is shared.

reduction in DRAM row buffer conflicts

Figure 8 shows the reduction in DRAM row buffer conflicts provided by ECH and a perfect L1 TLB (P-TLB) compared to Radix. ⁴

We observe that 1. ECH increases DRAM row buffer conflicts by 50% due to the increase in memory requests sent to DRAM and 2. P-TLB decreases row buffer conflicts by 30% due to the reduced number of DRAM row activations for translation metadata.

We conclude that designing more compact and efficient translation structures (and thus ideally approaching a perfect TLB) can lead to a significant reduction in memory hierarchy interference.

increase data accesses to the swap space

Figure 9 shows the increase in the number of data accesses to the swap space in a system that uses only the restrictive mapping across the whole memory, similar to [49], compared to the baseline system. ⁴

We observe that employing a restrictive address mapping in the entire memory space causes a significant increase in swap space accesses, 2.2× on average, since a large number of virtual pages cannot be mapped inside physical memory and need to be stored into and fetched from the swap space.

Fetching data from the swap space is orders of magnitude slower than fetching data from DRAM, which leads to significant performance overheads

Utopia's motivation

Conventional hashed based PT enable fast address translation with high translation-induced interference in the memory hierarchy.

Solution to reduce V2P overhead¶

superpage¶

how superpage get work

To increase the reach of the TLB, the x86-64 architecture supports two large page sizes, 2MB and 1GB. When a large page is used, the page table walk is shortened. Specifically, a 2MB page translation is obtained from the PMD table(1), while a 1GB page translation is obtained from the PUD table.¹

The advantage of super pages is that 1. a single TLB entry will be mapping a much larger amount(2) of virtual memory in turn reducing the number of TLB misses. 2. They also make the page table walk slightly faster on a TLB miss.¹⁵ 3. Reduce probability of page fault(3) when the first time the memory is accessed.¹⁶

The disadvantage of super pages is that a process might not need all of that memory and so memory can be wasted.¹⁵ And requiring larger clear-page copy-page in page faults.¹⁶

And superpage will keep the lower bits of page frame address reserved in PTE¹²¹⁷

PMD point to the 2MB page location (aligned in 2MB?) + lower bits offset => physical address
by a factor of 512 for megapages, and 262144 for gigapages
so reducing the enter/exit kernel frequency by a 512 or 262144 times factor

How TLB to cache the superpage

explianed in front section.

linux Transparent Hugepage Support

OSDI 2002

PWC¶

Page Walk Caches (PWCs)

PSCs (four in number for a five-level page table) store the recently accessed PTEs of intermediate page table levels. PTW searches all the PSCs concurrently after an STLB miss. In case of more than one hit, the farthest level is considered as it minimizes the page table walk latency.¹⁸

To alleviate the overhead of page table walks, the MMU of an x86-64 processor has small caches called Page Walk Caches (PWCs)(1). The PWCs store recently-accessed PGD, PUD, and PMD table entries (but not PTE entries). ¹

On a TLB miss, before the hardware issues any request to the cache hierarchy, it checks the PWCs. It records the lowest level table at which it hits. Then, it generates an access to the cache hierarchy for the next lower level table.

named as PSC in Intel SDM 4.10.3 Paging-Structure Caches

PWC in Intel

Page walks are cached in three ways:

translation caches (TLBs),
page table entries in the regular data cache hierarchy,
and through Page Walker Caches (PWCs), such as Intel’s Paging Structure Cache.

The PWC allows walks to skip lookups for some levels of page table by matching the index bits of each level of the page table node with those cached by previous page walks.

Intel’s PWC is organized in three depths of translation caching: L4, L3 and L2.

An L4 PWC holds previous walk paths that share the top 9 bit virtual address, allowing the walker to skip accessing the L4 page table entry, and go directly to the L3 page table entry. As each L4 entry covers 512 GB of virtual address space, this means that accesses that stay within a 512 GB virtual address range will hit in the PWC and be able to skip the L4 lookup.
- (Intel SDM version): PML4 can be accessed directly while skipping the PML5 table access if high-order 9 bits of the virtual address (VA[56:48]) required for the PML5 index are matched on a PML5-PSC (PSC holding the entries of the PML5 table).²¹
With an L2 PWC, a walk that matches all upper 27 bits of the virtual address will be able to skip the first three levels of the page table, and directly access the level 1 page table node. Such L2 PWC hits enable single-access translations (only a L1 entry access is required) for TLB misses within 2 MB regions of virtual address space.³
- (Intel SDM version): In the best case, if VA[56:21] is matched on the PD-PSC (PSC holding the entries of the PD table), the 5-step page walk is reduced to just a single step.²¹

Page Walker Caches are excellent. Page walk caches (PWCs) already reduce the theoretical 4 memory system accesses per page walk to < 1.5 on average (max 2.5 on our random access benchmark), and from 24 to 4.4 for virtualized systems in paper Figure 10³

But PWC sill rarely performs effectively for workloads with irregular access patterns.³

Why design PWC but not bigger TLB

to learn in ¹⁹

Hashed-based instead of radix tree¶

²

To solve hash collisions, the each entry in the hash table has a link list/collision chaining or open addressing basic machanisim or newer ¹³.

hash collisions: link list match

In this example, the logical address includes page number P3 which does not match the first element of the link list as it includes page number P1.
So we will move ahead and check the next element; now, this element has a page number entry, i.e., P3, so further, we will check the frame entry of the element, which is fr5.
We will append the offset provided in the logical address to this frame number to reach the page's physical address

Advantage: Assuming that there is no hash collision, only one memory system access is needed for address translation.
Challenges but solved in paper: ¹
1. the loss of spatial locality in the accesses to the page table. This is caused by hashing, which scatters the page table entries of contiguous virtual pages.
2. the need to associate a hash tag (e.g., the virtual page number) with each page table entry, which causes page table entries to consume more memory space.
3. the need to handle hash collisions, which leads to more memory accesses, as the system walks collision chains

Research 2016: "Page Table Entry Clustering & Compaction

Yaniv and Tsafrir ¹³ recently show that the first two limitations are addressable by careful design of page table entries. Specifically, they use Page Table Entry Clustering, where multiple contiguous page table entries are placed together in a single hash table entry that has a size equal to a cache line.

Further, they propose Page Table Entry Compaction(1), where unused upper bits of multiple contiguous page table entries are re-purposed to store the hash tag.

We propose a new technique that further improves hashed page tables by compacting eight adjacent PTEs in a single 64-byte cache line, resulting in the spatial locality of hashed page tables similar to that of the x86-64 radix page tables. The clustered page tables, as were previously defined, cannot pack eight PTEs and a tag in a single cache line, since PTEs are 8 bytes long. But we can exploit the unused topmost bits of each PTE and store the tag in this unused space.

Why not use Single Global Hash Table

A single global hash table that includes page table entries from all the active processes¹

Advantage: 1) the hash table is allocated only once, and 2) the table can be sized(1)
drawback * neither multiple page sizes (e.g., huge pages) nor page sharing between processes can be supported without additional complexity. * when a process is killed, the system needs to perform a linear scan of the entire hash table to find and delete the associated page table entries. Note that deleting an entry may also be costly

to minimize the need for dynamic table resizing, which is very time consuming.

Motivation: high hash collision probability

Figure 2 shows the probability of random numbers mapping to the same hash table entry.(1)¹

For the baseline table, we see that only 35% of the entries in the hash table have no collision (i.e, the number of colliding entries is 1). Even for the over-provisioned table(*1.5), only half of the entries in the table have no collision.

We evaluate the following scenario: (1) the table has as many entries as the sum of all the translations required by all the applications, and (2) the hash function is the computationallyexpensive BLAKE cryptographic function [5] which minimizes the probability of collisions.

Resizing research to reduce hash collisions but still costly

Resizing Hashed PT: hash table implementations set an occupancy threshold that, when reached, triggers the resizing of the table.¹
gradual rehashing : maintain both the old and the new hash tables and gradually move the entries.

Elastic Cuckoo Hashing

Elastic cuckoo hashing is a novel algorithm for cost-effective gradual resizing of d-ary cuckoo hash tables.

Key idea 1: target moved region

Operations： Rehash

Operations： Lookup(parallisim)

Operations： insert(1)

Key idea 2: resize threshold is just like the machanism in vector space allocation, k times bigger

We use x ↔ y to denote the swapping of the values x and y, and use ⊥ to denote an empty value.

multi h-tables for per process and diff-superpages in parallel

Cuckoo Walk to refer to the procedure of finding the correct translation in elastic cuckoo page tables.

Speedup access using CWTs and CWCs

Cuckoo Walk Tables (CWTs). These software tables contain information about which way of which elastic cuckoo page table should be accessed to obtain the desired page translation entry To reduce the number of look-ups required.

MMU's Cuckoo Walk Caches cache CWTs accessible with low latency. These caches replace the page walk caches of radix page tables.

Clustered Page Table¶

The clustered page tables are similar to hashed page tables except that each entry in the hash table refers to many pages rather than one single page (as in a hashed page table). Hence, a single entry of a clustered page table can store the mappings for multiple physical page frames. ²

Clustered page tables are useful for sparse address spaces, where memory references are scattered throughout the address space (non-contiguous).

Inverted Page Table¶

Motivation: Paging is multi-process mem-comsuming

The concept of normal paging says that every process maintains its own page table, which includes the entries of all the pages belonging to the process. The large process may have a page table with millions of entries. Such a page table consumes a large amount of memory. Consider we have six processes in execution. So, six processes will have some or the other of their page in the main memory, which would compel their page tables also to be in the main memory consuming a lot of space. This is the drawback of the paging concept.²

The inverted page table is the solution to this wastage of memory. And related advantage is Simplified Page Swapping and Improved Cache Performance

Inverted Page Table (IPT) is a global structure. Only a fixed portion of memory is required to store the paging information of all the processes together.

How IPT get work

The logical address consists of three entries process id(1)/ ASID(2), page number, and the offset.

The match of process id and associated page number is searched in the page table and says if the search is found at the ith entry of page table, then i and offset together generate the physical address for the requested page. The number i is the physical page number.

An inverted page table contains the address space information of all the processes in execution. Since two different processes can have a similar set of virtual addresses, it becomes necessary to store each process's process ID to identify its address space uniquely in the Inverted Page Table.
12 bits address-space identifier (ASID) identifies the process currently active on the CPU. This is used to extend the virtual address when accessing the TLB.

hashed IPT performence

The rate of hash table collisions(1) is mainly determined by the ratio of occupied slots to the total number of slots, also called the load factor¹³

Mod:conventional modulo hash ⁶

Fold:a stronger k-bit XOR folding

So we allocate four times of PTE for each 4KB page, due to the 1/4 load factor,

open addressing for resolving conflicts, access the next one exploiting the locality in the DRAM row buffer.

Why named Inverted

Indexing using the frame number(1) instead of the logical page number(2).(why named Inverted)
Refer to translation metadata size, IPT is proportional to physical memory size. But conventional radix PT is proportional to virtual address * process number.

the ith entry is correspond to the ith physical page in memory space
the ith entry is correspond to the ith malloced virtual page ???

IPT V.S. One Single PT

IPT	SPT
One IPT for all processes	each process has one
each memory frame has a PTE in IPT	each PTE has a (shared) mem-frame
Not waste mem	waste mem in multi-processes
implemented hash table for faster lookup	NO

Can the Inverted Page System suit all memory systems?

Note: Number of Entries in Inverted page table = Number of frames in Physical Address Space(PAS).

Inverted Page Table is best suited to systems having limited physical memory but a large virtual address space.
Page tables are more efficient for managing large virtual address spaces and provide better protection and control over memory access.

Disadvantage

Low searching speed the lookup is performed using a logical address(virtual address). It sometimes happens that the entire table is searched to find the match.
Difficult Shared Memory Implementation: As the Inverted Page Table stores a single entry for each frame, it becomes difficult to implement the shared memory in the page tables. Chaining techniques are used to map more than one virtual address to the entry specified in the order of frame number.

PACT'17: DIPTA (Distributed Inverted Page Table)

DIPTA restricts the associativity so that a page can only reside in a few number of physical locations which are physically adjacent–i.e., in the same memory chip and DRAM row.⁶

Ideas:

Put metedata/MMU/data closer, the translation and data fetch time more be overlapped.
Restrict metadata closer to data, the MMU can be positioned deeper, the time more be saved. 1.
Leveraging Cache Structure for Efficiency: Cache-like designs leverage the benifit of restricting address translation. Two key techniques help make caches fast and efficient:
1. Grouping n-ways metadata together into sets
2. Direct-mapping between sets - From the perspective of an individual set, the cache acts as direct-mapped storage, meaning there is a predictable mapping between a memory address and which entry within the set will store it. This eliminates complex logic to search the entire cache, streamlining the placement and lookup process.

**Overview **

General idea
SRAM Design
In-DRAM Design

Detail 1: Speed up by limit multi-ways-check to one DRAM row access.

Change page layout(2) to make cache-like j-way k-set-assiciative DIPTA just search way in one DRAM row(3) to reduce lookup latency.

Drawback:

the NAT paper ether PACT'17 or first author's PhD thesis is hard to read and lack graph to explaination. Several few diagrams presentation are inconsistent to the paper writing.
Lack of further discussion about the cache-like DRAM kick-out when the way is conflict or memory full.
Lack of further discussion about the organization of conventional AT components such as the TLB
No proof provided of the design's effectiveness, e.g., way predictor, Or critical path analysis.

is very similar to the VIPT L1 cache overlap v2p translation and cache index.
j*j matrix transpose, each page is divided to j parts.
The target DRAM row is determined by the highest order bits of the part of the virtual address that used to identify the DRAM column (i.e., the page offset).

MICRO'23 Utopia

Finish the total design, and use conventional FlexSeg to deal with way conflict.

But Still remains Drawback:

self-consistent basic design but may not efficient, especially in the migration from FlexSeg to RestSeg
1. e.g., Maybe translation will fall into dead loop of kick-out of RestSeg and migrate back to RestSeg. If there is no free space in the corresponding set of the RestSeg, the OS performs (i) the migration of the costly-to-translate page from the FlexSeg to the RestSeg and (ii) the migration of the evicted page from the RestSeg to the FlexSeg.
2. Utopia proved than less than 0.001% of the memory requests are affected due to migration, But still dare not show the avg latency of one migration.
Not further consider the details of TLB and PIM core and RestSeg co-design in PIM System.
hash in RestReg is no need
Cache-like structure lose the expansibility of DRAM size. Using base register and 512MB sharing global RestSeg, OS can alloc any fragment in DRAM to any process.

2 level fat-PT¶

³

Near-Memory Address Translation¶

Restricting the virtual-to-physical mapping: determining the physical location of a virtual page based on a specific set of bits of the virtual address is considerably faster than accessing the x86-64 multi-level PT.
and with a highly accurate way predictor
translation and data fetch to proceed independently and in parallel. / fully overlap address translation with data fetch.

example：address translation requests in MMU¶

relationship of TLB, PTW, PWC, Cache(VIPT L1)

Translation requests that miss in the L1 TLBs are forwarded to a unified L2 TLB that stores translations for both instructions and data.
In case of an L2 TLB miss, the MMU triggers a PTW(Page Table Walker)(1)
In order to reduce PTW latency, page table walkers are equipped with page walk caches (PWC), which are small dedicated caches for each level of the PT (e.g., search in PWC three times for the first three levels in x86-64).
In case of a PWC miss, the MMU issues the request(s) for the corresponding level of the PT to the conventional memory hierarchy(2)
and 6. If the physical address points to a page inside the swap space of the storage device, the MMU issues a request to the storage device to move the page from the swap space into the main memory

PTW is performed by a dedicated hardware page table walker capable of performing multiple concurrent PTWs.
Only L2 and LLC I guess, due to L1 typically does not store PT entries⁴, I think this is because VIPT L1 is in the front of DTLB.

page fault exception

If the physical address is not found in the PT, the MMU raises a page fault exception to pass control to the OS. OS will malloc the missing page in memory.

参考文献¶

Guvenilir 和 Patt - 2020 - Tailored Page Sizes.pdf

https://blog.csdn.net/zmx1026/article/details/46471439

https://www.bilibili.com/video/BV1N3411y7Mr?spm_id_from=444.41.0.0

知乎：高速缓存与一致性专栏索引 https://zhuanlan.zhihu.com/p/136300660

https://zhuanlan.zhihu.com/p/108425561

ASPLOS’20: Elastic Cuckoo Page Tables: Rethinking Virtual Memory Translation for Parallelism ↩↩↩↩↩↩↩↩↩↩
https://www.javatpoint.com/what-is-hashed-page-table-in-operating-system ↩↩↩
ASPLOS'22: Every Walk’s a Hit Making Page Walks Single-Access Cache Hits ↩↩↩↩
Utopia: Fast and Efficient Address Translation via Hybrid Restrictive & Flexible Virtual-to-Physical Address Mappings ↩↩↩↩↩↩↩↩↩↩↩↩↩
onur mutlu virtual memory PPT ↩↩↩↩
PACT'17 Near-Memory Address Translation ↩↩↩↩↩↩↩
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Vol. 3: System Programming Guide 3A ↩
ISCA'13 Efficient Virtual Memory for Big Memory Servers ↩↩↩
ISCA'15 Redundant Memory Mappings for Fast Access to Large Memories. ↩↩
Guilherme Cox and Abhishek Bhattacharjee. 2017. Efficient Address Translation for Architectures with Multiple Page Sizes. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems ↩
B. Pham, J. Vesely, G. Loh, and A. Bhattacharjee, “Large Pages and Lightweight Memory Management in Virtualized Systems: Can You Have it Both Ways?,” MICRO, 2015. ↩
Intel SDM /Section 4.5.4 /Figure 4-11. Formats of CR3 and Paging-Structure Entries with 4-Level Paging and 5-Level Paging ↩↩
Hash, Don’t Cache (the Page Table) ↩↩↩↩↩
Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, and Michael M. Swift. 2014. Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). ↩
What are Super pages w.r.t Page tables ? ↩↩
Linux Transparent Hugepage Support ↩↩
Intel SDM /Section 4.5.4 /Figure 4-9. Linear-Address Translation to a 2-MByte Page using 4-Level Paging ↩
Address Translation Conscious Caching and Prefetching for High Performance Cache Hierarchy ↩
T. W. Barr, A. L. Cox, and S. Rixner, “Translation caching: skip, don’t walk (the page table),” in Proceedings of the 2010 International Symposium on Computer Architecture, 2010. ↩
“Observations and opportunities in architecting shared virtual memory for heterogeneous systems,” in Proceedings of the 2016 International Symposium on Performance Analysis of Systems and Software, 2016. ↩
Pinning Page Structure Entries to Last-Level Cache for Fast Address Translation ↩↩
Arm Architecture Reference Manual for A-profile Architecture ↩
J. F. Couleur and E. L. Glaser, “Shared-access data processing system,” 1968, US Patent 3,412,382. Available: http://www.google.com.ar/patents/US3412382 ↩
Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. 2011. Shared Last-level TLBs for Chip Multiprocessors. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA’11). ↩
J. Bradley Chen, Anita Borg, and Norman P. Jouppi. 1992. A Simulation Based Study of TLB Performance. In Proceedings of the 19th Annual International Symposium on Computer Architecture (ISCA’92). ↩
M. Talluri, S. Kong, M. Hill, and D. Patterson, “Tradeoffs in Supporting Two Page Sizes,” ISCA, 1992. ↩
J. Navarro, S. Iyer, P. Druschel, and A. Cox, “Practical, Transparent Operating System Support for Superpages,” OSDI, 2002. ↩
Swapnil Haria, Mark D. Hill, and Michael M. Swift. 2018. Devirtualizing Memory in Heterogeneous Systems. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). ↩
A. Bhattacharjee, “Large-reach memory management unit caches,” in Proceedings of the 2013 International Symposium on Microarchitecture, 2013. ↩