### A Promising Semantics for Relaxed-Memory Concurrency Jeehoon Kang<sup>1</sup> Chung-Kil Hur<sup>1</sup> **Ori Lahav**<sup>2</sup> Viktor Vafeiadis<sup>2</sup> Derek Dreyer<sup>2</sup> <sup>1</sup>Seoul National University <sup>2</sup>Max Planck Institute for Software Systems (MPI-SWS) Kent Concurrency Workshop, July 2016 #### Programming language concurrency semantics What is the right semantics for a concurrent programming language? - Allow efficient implementation on modern hardware - Validate compiler optimizations - Support high-level reasoning principles - Avoid "undefined behavior" Despite many years of research, no semantics was proven to admit all of the desired properties. #### Programming language concurrency semantics #### In particular: - The Java model fails to validate common compiler optimizations. - ► The C11 model allows out-of-thin-air behaviors, that break fundamental reasoning principles. - Stronger semantics for C11 (preserve load-store ordering for relaxed accesses) has some performance impact, and relies on undefined behavior for non-atomic accesses. - ▶ Initially, x = y = 0. - ► All accesses are "relaxed". Load-buffering $$a:=x; \ /\!\!/ 1 \ y:=1; \ x:=y;$$ This behavior must be allowed: Power/ARM allow it - ▶ Initially, x = y = 0. - ► All accesses are "relaxed". #### Load-buffering $$a := x; //1$$ $y := 1;$ $x := y;$ This behavior must be allowed: Power/ARM allow it ### Load-buffering + data dependency a := x; $/\!\!/ 1$ y := a; x := y; The behavior should be forbidden: **Values appear out-of-thin-air!** #### Load-buffering + data dependency $$a := x; // 1 y := a;$$ $x := y;$ The behavior should be forbidden: **Values appear out-of-thin-air!** Same execution as before! C11 allows these behaviors #### Load-buffering + data dependency $$a := x; //1 y := a;$$ $x := y;$ The behavior should be forbidden: Values appear out-of-thin-air! #### ${\sf Load\text{-}buffering} + {\sf control} \ {\sf dependencies}$ $$a := x; // 1$$ if $(a = 1)$ $y := 1$ if $(y = 1)$ $x := 1$ The behavior should be forbidden: **DRF guarantee is broken!** Same execution as before! C11 allows these behaviors #### The hardware solution Keep track of syntactic dependencies, and forbid "dependency cycles". #### Load-buffering + data dependency $$a := x$$ ; // 1 $$y := a$$ ; $$x := y$$ ; #### The hardware solution Keep track of syntactic dependencies, and forbid "dependency cycles". #### Load-buffering + data dependency $$a := x; // 1$$ $y := a;$ $$x := y$$ ; #### Load-buffering + fake dependency $$a := x; //1$$ $y := a + 1 - a;$ $x := y;$ This approach is not suitable for a programming language: Compilers do not preserve syntactic dependencies. #### A "promising" semantics for relaxed-memory concurrency We propose a model that satisfies all these goals, and covers nearly all features of C11. - DRF guarantees - ▶ No "out-of-thin-air" values - Avoid "undefined behavior" - Efficient implementation on modern hardware - Compiler optimizations **Key idea:** Start with an operational semantics, and allow threads to promise to write in the future ``` Store-buffering \begin{aligned} x &= y = 0 \\ x &:= 1; \\ a &:= y \ /\!\!/ \ 0 \end{aligned} \quad \begin{aligned} y &:= 1; \\ b &:= x \ /\!\!/ \ 0 \end{aligned} ``` ## Store-buffering $\begin{aligned} x &= y = 0 \\ \blacktriangleright x &:= 1; \\ a &:= y \ \# \ 0 \end{aligned} \quad \blacktriangleright \ y &:= 1; \\ b &:= x \ \# \ 0 \end{aligned}$ ``` Memory \langle x:0@0\rangle \langle y:0@0\rangle ``` $$\begin{array}{cc} T_1 \text{'s view} \\ \hline x & y \\ \hline 0 & 0 \end{array}$$ $$\begin{array}{ccc} T_2 \text{'s view} \\ \hline x & y \\ \hline 0 & 0 \end{array}$$ Global memory is a pool of messages of the form ## Store-buffering $\begin{aligned} x &= y = 0 \\ x &:= 1; \\ \blacktriangleright a &:= y \ /\!\!/ \ 0 \end{aligned} \quad \blacktriangleright \begin{array}{c} y &:= 1; \\ b &:= x \ /\!\!/ \ 0 \end{array}$ ``` Memory \langle x:0@0\rangle \langle y:0@0\rangle \langle x:1@1\rangle ``` $$\begin{array}{c|c} T_1 \text{'s view} \\ \hline x & y \\ \hline x & 0 \\ 1 \end{array}$$ $$\begin{array}{ccc} T_2's & view \\ x & y \\ \hline 0 & 0 \end{array}$$ Global memory is a pool of messages of the form ## Store-buffering $\begin{aligned} x &= y = 0 \\ x &:= 1; \\ \blacktriangleright a &:= y \ /\!\!/ \ 0 \end{aligned} \quad \begin{aligned} y &:= 1; \\ \blacktriangleright b &:= x \ /\!\!/ \ 0 \end{aligned}$ ``` Memory \langle x : 0@0 \rangle \langle y : 0@0 \rangle \langle x : 1@1 \rangle \langle y : 1@1 \rangle ``` $$T_1$$ 's view $$\begin{array}{c|c} X & y \\ \hline & 0 \\ \hline & 1 \end{array}$$ Global memory is a pool of messages of the form ``` Memory \langle x : 0@0 \rangle \langle y : 0@0 \rangle \langle x : 1@1 \rangle \langle y : 1@1 \rangle ``` $$\begin{array}{c|c} T_1's \text{ view} \\ \hline x & y \\ \hline x & 0 \end{array}$$ $\begin{array}{c|c} T_2 \text{'s view} \\ \hline x & y \\ \hline 0 & X \\ \hline & 1 \\ \end{array}$ Global memory is a pool of messages of the form ``` Memory ⟨x : 0@0⟩ ⟨y : 0@0⟩ ⟨x : 1@1⟩ ⟨y : 1@1⟩ ``` $$T_1$$ 's view $\frac{x \quad y}{0}$ $$\begin{array}{c|c} T_2 \text{'s view} \\ \hline x & y \\ \hline 0 & X \\ \hline & 1 \\ \end{array}$$ Global memory is a pool of messages of the form ``` Memory \( \lambda : 0@0 \rangle \) \( \lambda : 0@0 \rangle \) \( \lambda : 1@1 \rangle \) \( \lambda : 1@1 \rangle \) ``` Coherence Test $$x = 0$$ $x := 1;$ $x := 2;$ $b := x // 1$ Memory $$\langle x:0@0\rangle$$ $\langle y:0@0\rangle$ $\langle x:1@1\rangle$ $\langle y:1@1\rangle$ $$\begin{array}{c|c} T_1 \text{'s view} & T_2 \text{'s view} \\ \hline x & y \\ \hline & 0 & 0 \\ \hline & 1 & 1 \end{array}$$ Memory $$\langle x:0@0\rangle$$ $$\frac{T_1$$ 's view $\frac{X}{0}$ $$T_2$$ 's view $\frac{x}{0}$ Store-buffering $$\begin{aligned} x &= y = 0 \\ x &:= 1; \\ a &:= y \ \# 0 \end{aligned} \quad \begin{aligned} y &:= 1; \\ b &:= x \ \# 0 \end{aligned}$$ Memory $$\langle x:0@0\rangle$$ $\langle y:0@0\rangle$ $\langle x:1@1\rangle$ $\langle y:1@1\rangle$ $$\begin{array}{c|c} T_1 \text{'s view} & T_2 \text{'s view} \\ \hline x & y \\ \hline \downarrow & 0 & \hline 0 & \hline \downarrow \\ 1 & & 1 \end{array}$$ Memory $$\langle x:0@0\rangle$$ $\langle x:1@1\rangle$ $$T_1$$ 's view $X$ $$T_2$$ 's view $\frac{x}{0}$ #### Memory $$\langle x:0@0\rangle$$ $\langle y:0@0\rangle$ $\langle x:1@1\rangle$ $\langle y:1@1\rangle$ $$\begin{array}{c|c} T_1 \text{'s view} \\ \hline x & y \\ \hline 0 & \boxed{0} \\ 1 \end{array}$$ Coherence Test $$x = 0$$ $$x := 1;$$ $$a := x // 2 | x := 2;$$ $$b := x // 1$$ Memory $$\langle x:0@0\rangle$$ $\langle x:1@1\rangle$ $\langle x:2@2\rangle$ $$\begin{array}{c|c} T_1 \text{'s view} \\ \hline X \\ \hline 1 \end{array}$$ $$\begin{array}{c|c} T_2 \text{'s} \\ \hline X \\ \hline 2 \end{array}$$ $$T_2$$ 's view $X$ Memory $$\langle x:0@0\rangle$$ $\langle y:0@0\rangle$ $\langle x:1@1\rangle$ $\langle y:1@1\rangle$ Memory $$\langle x:0@0\rangle$$ $\langle x:1@1\rangle$ $\langle x:2@2\rangle$ $$T_1$$ 's view $X$ $$\frac{T_2$$ 's view $\frac{x}{x}$ Store-buffering $$\begin{aligned} x &= y = 0 \\ x &:= 1; & y := 1; \\ a &:= y \ \# \ 0 \end{aligned} \quad b := x \ \# \ 0$$ Memory $$\langle x:0@0\rangle$$ $\langle y:0@0\rangle$ $\langle x:1@1\rangle$ $\langle y:1@1\rangle$ $$T_1$$ 's view $$\begin{array}{ccc} X & y \\ \hline X & 0 \\ 1 & & 1 \end{array}$$ Coherence Test $$x = 0$$ $x := 1;$ $a := x // 2 | x := 2;$ $b := x // 1$ Memory $$\langle x:0@0\rangle$$ $\langle x:1@1\rangle$ $\langle x:2@2\rangle$ $$T_1$$ 's view $X$ $$T_2$$ 's view $X$ ``` Load-buffering \begin{aligned} x &= y = 0 \\ a &:= x; \ /\!\!/ \ 1 \\ y &:= 1; \end{aligned} \qquad x := y; ``` - ➤ To model load-store reordering, we allow "promises". - At any point, a thread may promise to write a message in the future, allowing other threads to read from the promised message. $$T_1$$ 's view $\frac{x}{0} = \frac{y}{0}$ $T_2$ 's view $\frac{x}{0}$ $\frac{y}{0}$ - ➤ To model load-store reordering, we allow "promises". - ▶ At any point, a thread may promise to write a message in the future, allowing other threads to read from the promised message. $$\frac{T_1's \text{ view}}{\frac{x}{0}}$$ $\begin{array}{ccc} T_2 \text{'s view} \\ \hline x & y \\ \hline 0 & 0 \end{array}$ - ➤ To model load-store reordering, we allow "promises". - At any point, a thread may promise to write a message in the future, allowing other threads to read from the promised message. $\frac{T_1's \text{ view}}{x \quad y}$ $T_2$ 's view $\frac{x}{0}$ $\frac{y}{1}$ - ➤ To model load-store reordering, we allow "promises". - ▶ At any point, a thread may promise to write a message in the future, allowing other threads to read from the promised message. ``` Memory \langle x:0@0\rangle \langle y:0@0\rangle \langle y:1@1\rangle \langle x:1@1\rangle ``` $$T_1$$ 's view $\frac{x}{0}$ $\frac{y}{0}$ $\begin{array}{c|c} T_2 \text{'s view} \\ \hline x & y \\ \hline x & x \end{array}$ $\begin{array}{c|c} 1 & 1 \end{array}$ - ➤ To model load-store reordering, we allow "promises". - ▶ At any point, a thread may promise to write a message in the future, allowing other threads to read from the promised message. $$T_1$$ 's view $\begin{array}{cc} X & y \\ \hline & 0 \\ 1 \end{array}$ *T*<sub>2</sub>'s view *X Y X*1 1 - ➤ To model load-store reordering, we allow "promises". - At any point, a thread may promise to write a message in the future, allowing other threads to read from the promised message. ``` Memory \langle x : 0@0 \rangle \langle y : 0@0 \rangle \langle y : 1@1 \rangle \langle x : 1@1 \rangle ``` $$T_1$$ 's view $\frac{x \quad y}{\cancel{x} \quad \cancel{x}}$ 1 $$\begin{array}{c|c} T_2's \text{ view} \\ \hline x & y \\ \hline x & x \end{array}$$ $$1 & 1$$ - ➤ To model load-store reordering, we allow "promises". - ▶ At any point, a thread may promise to write a message in the future, allowing other threads to read from the promised message. Memory $$\langle x:0@0\rangle$$ $\langle y:0@0\rangle$ $\langle y:1@1\rangle$ $\langle x:1@1\rangle$ Load-buffering + dependency $$a := x; \ // 1 \\ y := a;$$ $x := y;$ Must not admit the same execution! #### Load-buffering $$x = y = 0$$ $a := x; // 1$ $y := 1;$ $x := y;$ #### Load-buffering + dependency $$a := x; //1 y := a;$$ $x := y;$ #### Key Idea A thread can only promise if it can perform the write anyway (even without having made the promise) #### Certified promises #### Thread-local certification A thread can promise to write a message, if it can *thread-locally certify* that its promise will be fulfilled. #### Certified promises #### Thread-local certification A thread can promise to write a message, if it can thread-locally certify that its promise will be fulfilled. #### Load-buffering $$a := x; //1 \ y := 1;$$ $x := y;$ #### Load buffering + fake dependency $$a := x; \ // 1$$ $y := a + 1 - a; \ || \ x := y;$ $T_1$ may promise y = 1, since it is able to write y = 1 by itself. #### Load buffering + dependency $T_1$ may **NOT** promise y = 1, since it is not able to write y = 1 by itself. #### The full model - Atomic updates - ► Release/acquire fences and accesses - Release sequences - SC fences and accesses. - ▶ Plain accesses (C11's non-atomics & Java's normal accesses) ### Access Modes pln □ rlx □ ra □ sc To achieve all of this we enrich our timestamps, messages, and thread views. □ Compiler optimizations □ Efficient implementation on modern hardware □ DRF guarantees □ No "out-of-thin-air" values ☑ Avoid "undefined behavior" - ✓ Compiler optimizations - □ DRF guarantees - ☐ Efficient implementation on modern hardware - □ No "out-of-thin-air" values - Avoid "undefined behavior" #### Theorem (Local Program Transformations) The following transformations are sound: - Trace-preserving transformations - Reorderings: $$\begin{array}{lll} R_{\sqsubseteq r1z}^{\times}; R^{\mathcal{V}} & \qquad & \mathbb{W}^{\times}; \mathbb{W}_{\sqsubseteq r1z}^{\mathcal{V}} & \qquad & \mathbb{W}_{o_{1}}^{\times}; R_{o_{2}}^{\mathcal{V}} \text{ unless } o_{1} = o_{2} = \mathtt{sc} \\ R_{\sqsubseteq r1z}^{\times}; R_{p1n}^{\times} & \qquad & R_{\sqsubseteq r1z}^{\times}; \mathbb{W}_{\sqsubseteq r1z}^{\mathcal{V}} & \qquad & R_{\neq r1z}^{\times}; F_{\mathrm{acq}} \\ \mathbb{W}; F_{\mathrm{acq}} & \qquad & F_{\mathrm{rel}}; \mathbb{W}_{\neq r1z} & \qquad & F_{\mathrm{rel}}; R \end{array}$$ Merges: $$\mathbf{R}_o; \mathbf{R}_o \leadsto \mathbf{R}_0 \hspace{0.5cm} \mathbf{W}_o; \mathbf{W}_o \leadsto \mathbf{W}_o \hspace{0.5cm} \mathbf{W}; \mathbf{R}_{\mathtt{ra}} \leadsto \mathbf{W} \hspace{0.5cm} \mathbf{W}_{\mathtt{sc}}; \mathbf{R}_{\mathtt{sc}} \leadsto \mathbf{W}_{\mathtt{sc}}$$ - ✓ Compiler optimizations - □ DRF guarantees - ✓ Efficient implementation on modern hardware - □ No "out-of-thin-air" values - Avoid "undefined behavior" #### Theorem (Compilation to TSO/Power) - Standard compilation to TSO is correct - ▶ TSO can be fully explained by transformations over SC - Compilation to Power is correct - Using an axiomatic presentation of the promise-free machine ✓ Compiler optimizations ✓ DRF guarantees ✓ Efficient implementation on modern hardware ✓ Avoid "undefined behavior" # Theorem (DRF Theorems) Key Lemma Races only on ra/sc under promise-free semantics ⇒ only promise-free behaviors DRF-RA Races only on ra/sc under release/acquire semantics ⇒ only release/acquire behaviors DRF-SC Races only on sc under SC semantics ⇒ only SC behaviors | ✓ Compiler optimizations | ✓ DRF guarantees | |----------------------------|-------------------------------| | ✓ Efficient implementation | ✓ No "out-of-thin-air" values | | on modern hardware | ✓ Avoid "undefined behavior" | Avoid "undefined behavior" #### Theorem (Invariant-Based Program Logic) Fix a global invariant J. Hoare logic where all assertions are of the form $P \wedge J$ , where P mentions only local variables, is sound. - ✓ Compiler optimizations - ✓ Efficient implementation on modern hardware - ☑ DRF guarantees - Avoid "undefined behavior" #### Theorem (Invariant-Based Program Logic) Fix a global invariant J. Hoare logic where all assertions are of the form $P \wedge J$ , where P mentions only local variables, is sound. #### ${\sf Load\text{-}buffering} + {\sf data} \ {\sf dependency}$ $$x = y = 0$$ $$\begin{cases} J \\ a := x; \\ \{J \land (a = 0)\} \\ y := a; \\ \{J\} \end{cases}$$ $$\begin{cases} J \\ x := y; \\ \{J\} \end{cases}$$ $$\begin{cases} J \\ = (x = 0) \land (y = 0) \end{cases}$$ #### Future Work - Correct compilation to ARMv8 - Global transformations and sequentialization - Liveness - ▶ Program logic See http://sf.snu.ac.kr/promise-concurrency/for Coq proofs. #### Future Work - Correct compilation to ARMv8 - Global transformations and sequentialization - Liveness - Program logic See http://sf.snu.ac.kr/promise-concurrency/for Coq proofs. Thank you! #### Atomic updates - To obtain atomicity, the timestamp order keeps track of immediate adjacency. - Main challenge: threads performing updates may invalidate the already-certified promises of other threads. #### Atomic updates - ► To obtain *atomicity*, the timestamp order keeps track of immediate adjacency. - Main challenge: threads performing updates may invalidate the already-certified promises of other threads. $$a := x; // 1$$ $b := z++; // 0$ $y := b+1;$ $x := y;$ $z++;$ ▶ Solution: require certification for *every future memory*. #### Guiding Principle of Thread Locality The set of actions a thread can take is determined only by the current memory and its own state. #### Certification is needed at every step #### Sequentialization is unsound ``` a := x; //1 if a = 0 then y := x; x := y; \Rightarrow \begin{cases} a := x; //1 \\ \text{if } a = 0 \text{ then} \\ x := 1; \\ y := x; \end{cases} x := y; ```