Upcoming events

Combinatorial Constructions for Effective Testing

Filip Nikšic
Max Planck Institute for Software Systems
SWS Student Defense Talks - Thesis Defense
03 May 2019, 3:30 pm - 4:45 pm
Kaiserslautern building G26, room 111
simultaneous videocast to Saarbrücken building E1 5, room 029 / Meeting ID: 6312
Large-scale distributed systems consist of a number of components, take a number of parameter values as input, and behave differently based on a number of non-deterministic events. All these features—components, parameter values, and events—interact in com- plicated ways, and unanticipated interactions may lead to bugs. Empirically, many bugs in these systems are caused by interactions of only a small number of features. In certain cases, it may be possible to test all interactions of k features for a small constant k by executing a family of tests that is exponentially or even doubly-exponentially smaller than the family of all tests. Thus, in such cases we can effectively uncover all bugs that require up to k-wise interactions of features.

In this thesis we study two occurrences of this phenomenon. First, many bugs in distributed systems are caused by network partition faults. In most cases these bugs occur due to two or three key nodes, such as leaders or replicas, not being able to communicate, or because the leading node finds itself in a block of the partition without quorum. Second, bugs may occur due to unexpected schedules (interleavings) of concurrent events— concurrent exchange of messages and concurrent access to shared resources. Again, many bugs depend only on the relative ordering of a small number of events. We call the smallest number of events whose ordering causes a bug the depth of the bug. We show that in both testing scenarios we can effectively uncover bugs involving small number of nodes or bugs of small depth by executing small families of tests.

We phrase both testing scenarios in terms of an abstract framework of tests, testing goals, and goal coverage. Sets of tests that cover all testing goals are called covering families. We give a general construction that shows that whenever a random test covers a fixed goal with sufficiently high probability, a small randomly chosen set of tests is a covering family with high probability. We then introduce concrete coverage notions relating to network partition faults and bugs of small depth. In case of network partition faults, we show that for the introduced coverage notions we can find a lower bound on the probability that a random test covers a given goal. Our general construction then yields a randomized testing procedure that achieves full coverage—and hence, find bugs— quickly.

In case of coverage notions related to bugs of small depth, if the events in the program form a non-trivial partial order, our general construction may give a suboptimal bound. Thus, we study other ways of constructing convering families. We show that if the events in a concurrent program are partially ordered as a tree, we can explicitly construct a covering family of small size: for balanced trees, our construction is polylogarithmic in the number of events. For the case when the partial order of events does not have a "nice" structure, and the events and their relation to previous events are revealed while the program is running, we give an online construction of covering families. Based on the construction, we develop a randomized scheduler called PCTCP that uniformly samples schedules from a covering family and has a rigorous guarantee of finding bugs of small depth. We experiment with an implementation of PCTCP on two real-world distributed systems—Zookeeper and Cassandra—and show that it can effectively find bugs.

Programming Abstractions for Verifiable Software

Damien Zufferey
Max Planck Institute for Software Systems
Joint Lecture Series
15 May 2019, 12:15 pm - 1:15 pm
Saarbrücken building E1 5, room 002
In this talk, I will show how we can harness the synergy between programming languages and verification methods to help programmers build reliable software. First, we will look at fault-tolerant distributed algorithms. These algorithms are central to any high-availability application but they are notoriously difficult to implement due to asynchronous communication and faults. A fault- tolerant consensus algorithms which can be described in ~50 lines of pseudo code can easily turns into a few thousand lines of actual code. To remediate this, I will introduce PSync a domain specific language for fault-tolerant distributed algorithms. The key insight is the use of communication-closure (logical boundaries in a program that messages should not cross) to structure the code. Communication-closure gives a syntactic scope to the communication, provides some form of logical time, and give the illusion of synchrony. These element greatly simplify the programming and verification of fault-tolerant algorithms. In the second part of the talk, we will discuss a new project exploring how advances in rapid prototyping (3D printers) may impact how we develop software for robots. These advances may soon be enable adding computational elements as part of the internal structure of objects. The goal of this project is to rethink the software/hardware boundary and integrate the two together. I will present a system we are developing where components integrate for geometry (hardware) and behavior (software). The system allows from bottom-up composition and top-down decomposition. The bottom-up composition connects components together to achieve more complex behaviors. The top-down decomposition project a global specification on the individual components and performs verification at the level of individual components.

Transparent Scaling of Deep Learning Systems through Dataflow Graph Analysis

Jinyang Li
New York University
SWS Distinguished Lecture Series
17 May 2019, 10:30 am - 12:00 pm
Saarbrücken building E1 5, room 002
simultaneous videocast to Kaiserslautern building G26, room 111 / Meeting ID: 6312
As deep learning research pushes towards using larger and more sophisticated models, system infrastructure must use many GPUs efficiently. Analyzing the dataflow graph that represents the DNN computation is a promising avenue for optimization. By specializing execution for a given dataflow graph, we can accelerate DNN computation in ways that are transparent to programmers. In this talk, I show the benefits of dataflow graph analysis by discussing two recent systems that we've built to support large model training and low-latency inference. To train very large DNN models, Tofu automatically re-writes a dataflow graph of tensor operators into an equivalent parallel graph in which each original operator can be executed in parallel across multiple GPUs.  To achieve low-latency inference, Batchmaker discovers identical sub-graph computation among different requests to enable batched execution of requests arriving at different times. 

Recent events

The complexity of reachability in vector addition systems

Sylvain Schmitz
École Normale Supérieure Paris-Saclay
SWS Colloquium
05 Apr 2019, 2:00 pm - 3:00 pm
Saarbrücken building G26, room 111
simultaneous videocast to Kaiserslautern building E1 5, room 029 / Meeting ID: 6312
The last few years have seen a surge of decision problems with an astronomical, non primitive-recursive complexity, in logic, verification, and games. While the existence of such uncontrived problems is known since the early 1980s, the field has matured with techniques for proving complexity lower and upper bounds and the definition of suitable complexity classes. This framework has been especially successful for analysing so-called `well-structured systems'---i.e., transition systems endowed with a well-quasi-order, which underly most of these astronomical problems---, but it turns out to be applicable to other decision problems that resisted analysis, including two famous problems: reachability in vector addition systems and bisimilarity of pushdown automata. In this talk, I will explain how some of these techniques apply to reachability in vector addition systems, yielding tight Ackermannian upper bounds for the decomposition algorithm initially invented by Mayr, Kosaraju, and Lambert; this will be based on joint work with Jérôme Leroux.

Worst-Case Execution Time Guarantees for Runtime-Reconfigurable Architectures

Marvin Damschen
Karlsruhe Institute of Technology
SWS Colloquium
04 Apr 2019, 2:00 pm - 3:00 pm
Kaiserslautern building G26, room 111
simultaneous videocast to Saarbrücken building E1 5, room 029 / Meeting ID: 6312
Real-time embedded systems need to be analyzable for execution time guarantees. Despite significant scientific advances, however, timing analysis lags years behind current microarchitectures with out-of-order scheduling pipelines, several hardware threads and multiple (shared)cache layers. To satisfy the increasing demand for predictable performance, analyzable performance features are required. We introduce runtime-reconfigurable instruction set processors as one way to escape the scarcity of analyzable performance features while preserving the flexibility of the system. To this end, we first present a reconfiguration controller for guaranteed reconfiguration delays of accelerators onto an FPGA. We propose a novel timing analysis approach to obtain worst-case execution time (WCET) guarantees for applications that utilize runtime-reconfigurable custom instructions (CIs), which each utilize one or more accelerators. Given the constrained reconfigurable area of an FPGA, we solve the problem of selecting CIs for each computational kernel of an application to optimize its worst-case execution time. Finally, we show that runtime reconfiguration provides the unique feature of optimized static WCET guarantees and optimization of the average-case execution time (maintaining statically-given WCET guarantees) by repurposing reconfigurable area for different selections of CIs at runtime.

The Server-to-Server Landscape: Insights, Opportunities, and Challenges

Balakrishnan Chandrasekaran
Joint Lecture Series
03 Apr 2019, 12:15 pm - 1:15 pm
Saarbrücken building E1 5, room 002
A significant fraction of the Internet traffic today is delivered via content delivery networks (CDNs). Typically, the CDN hauls data from content providers to its back-end servers, moves this data (via an overlay) to its front-end servers, and serves the data from there to the end users. This server-to-server (i.e., back-end to front-end) landscape provides a few unique perspectives as well as opportunities for solving some long-standing networking problems. In this talk, I will briefly discuss these perspectives and opportunities, and present one key measurement effort that leverages this landscape—quantifying the effect of path choices in the Internet’s core on end-user’s experiences.