News 2019

Keon Jang joins MPI-SWS

June 2019
Keon Jang has joined the institute as a tenure-track faculty member, effective February 1, 2019. Keon Jang is joining us from Google, where he has been a software engineer since 2016. He is broadly interested in network systems and currently his work focuses on network performance isolation in data-center networks.

Prior to Google, he worked on software support for on network function virtualization (NFV) at Intel Labs. He received his PhD in Computer Science from KAIST, and subsequently held a postdoctoral research position at Microsoft Research Cambridge, UK.

Research Spotlight: Tracing the Behavior of Cloud Applications

April 2019
Consider the everyday websites and apps that we use: online shops, news websites, search engines, social networks, navigation apps, instant messaging apps, and many more.  Most of these programs don't just run in isolation on our laptops or phones, but instead connect over the internet to backends and databases running in datacenters across the world.  These backends perform a wide range of tasks, including constructing your personalized social network feed, storing and retrieving comments on message boards, and calculating results for your search query.  From our perspective as users, the actions we perform are simple, such as opening the app and loading our personalized profile.  But under the hood, each action usually results in complex processing across many processes and machines in a datacenter.

It has never been easier to write and deploy complex programs like these.  Cloud computing companies who own datacenters (such as Google, Amazon, and Microsoft) will gladly rent out computer services at a touch of a button, on demand.  Using designs like microservices, it is easy for programmers to construct complex programs out of smaller, simpler building blocks.  There are frameworks and open-source software packages to help developers construct big applications out of small pieces, to spread those pieces out over multiple machines in a datacenter, and to have the pieces communicate and interact with each other over the network.

Problems show up when software goes live.  Compared to developing and deploying the software, it is much harder to make sure everything goes smoothly when the software is up and running.  Distributed computer programs have lots of moving pieces, and there are lots of opportunities for things to go wrong.  For example, if one machine in the datacenter has a hardware problem, or the code is buggy, or too many people are trying to access it at once, the effects can be wide-ranging.  It can create a butterfly effect of problems, which we term cascading failures, that can lead to the app or website as a whole becoming slow, or going down entirely.  It's hard for programmers to get to the bottom of these kinds of problems, because there's no single machine or process doing all the work.  A problem that occurs on one machine might manifest as strange symptoms on a different machine later on.  Figuring out the root cause of a problem is challenging, as is anticipating problems in the first place.  Even big internet companies like Facebook and Google experience problems like this today.

These kinds of problems motivate the research of the Cloud Software Systems Research Group at the Max Planck Institute for Software Systems.  We research ways for operators to understand what's going on in their live distributed system, to troubleshoot problems when they occur at runtime, and to design systems that proactively avoid problems.  One approach we take is to design distributed tracing tools that can be used by the system operators.  The goal of distributed tracing is to record information about what a program does while it's running.  The tools record events, metrics, and performance counters, which together expose the current state and performance of the system, and how it changes over time.  A key additional step taken by distributed tracing tools is to record the causal ordering of events happening in the system — that is, the interactions and dependencies between machines and processes.  Causal ordering is is very useful for diagnosing problems that span multiple processes and machines, especially when there might be lots of concurrent, unrelated activity going on at the same time.  It lets us reconstruct the end-to-end execution paths of requests, across all components and machines, and then reason about the sequence of conditions and events that led up to a problem.  Without causal ordering, this information is missing, and pinpointing the root cause of a problem would be like searching for a needle in a haystack.

The Cloud Software Systems Research Group has looked at a number of challenges in making distributed tracing tools efficient, scalable, and more widely deployable. In our recent work, we have thought about how you can efficiently insert instrumentation to record entirely new information, into an already-running system, without having to rebuild or restart the system [1].  We have looked at problems in dealing with the large volume of data generated by distributed tracing tools, and deciding which data is most valuable to keep if there's not enough room to keep it all [2].  We have also considered the implications of distributed tracing at extremely large scale, and how to efficiently collect, aggregate, and process tracing data in real-time [3].

In our ongoing work, we are investigating ways for the data recorded by tracing tools to feed back in to decisions made by datacenter infrastructure, such as resource management, scheduling, and load balancing.  We are also considering new challenges that arise in scalable data analysis: how do you analyze large datasets of traces and derive insights about aggregate system behavior?  One approach we are exploring uses techniques in representational machine learning, to transform richly annotated tracing data into a more tractable form for interactive analysis.  More broadly, our group investigates a variety of approaches besides just distributed tracing tools, including ways to better design and develop the distributed systems in the first place.  Ultimately, our goal is to make modern cloud systems easier to operate, understand, and diagnose.

References


[1] Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca.  Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems.  In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP '15), 2015.


[2] Pedro Las-Casas, Jonathan Mace, Dorgival Guedes, and Rodrigo Fonseca.  Weighted Sampling of Execution Traces: Capturing More Needles and Less Hay.  In Proceedings of the 9th ACM Symposium on Cloud Computing (SoCC'18), 2018.


[3] Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, Yee Jiun Song.  Canopy: An End-to-End Performance Tracing And Analysis System. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP '17), 2017.

Girls Day 2019

April 2019
As in the years before, the MPIs for Informatics and Software Systems jointly participated in the annual Girls' Day on 28 March. We welcomed a record number of 27 school-aged girls to our institutes to provide them with a bit of insight into computer science. Together we programmed smartphone apps, soldered blinking smiles and running bugs, and solved hands-on computer science puzzles.

Research Spotlight: A General Data Anonymity Measure

January 2019
A long-standing problem both within research and in society generally is that of how to analyze data about people without risking the privacy of those people. There is an ever-growing amount of data about people: medical, financial, social, government, geo-location, etc. This data is very valuable in terms of better understanding ourselves. Unfortunately, analyzing the data in its raw form carries the risk of exposing private information about people.

The problem of how to analyze data while protecting privacy has been around for more than 40 years---ever since the first data processing systems were developed. Most workable solutions are ad hoc: practitioners try things like removing personally identifying information (e.g. names and addresses), aggregating data, removing outlying data, and even swapping some data between individuals. This process can work reasonably well, but it is time-consuming, requires substantial expertise to get right, and invariably limits the accuracy of the analysis or the types of analysis that can be done.

A holy grail within computer science is to come up with an anonymization system that has formal guarantees of anonymity and provides good analytics. Most effort in this direction has focused on two ideas, K-anonymity and Differential Privacy. Both can provide strong anonymity, but except in rare cases neither can do so while still providing adequate analytics. As a result, common practice is still to use informal ad hoc techniques with weak anonymization, and to mitigate risk by for instance sharing data only with trusted partners.

The European Union has raised the stakes with the General Data Protection Regulation (GDPR). The GDPR has strict rules on how personal data may be used, and threatens large fines to organizations that do not follow the rules. However, GDPR says that if data is anonymous, then it is not considered personal and does not fall under the rules. Unfortunately, there are no precise guidelines on how to determine if data is anonymous or not. Member states are expected to come up with certification programs for anonymity, but do not know how to do so.

This is where we come in. Paul Francis' group, in research partnership with the startup Aircloak, has been developing an anonymizing technology called Diffix over the last five years. Diffix is an empirical, not a formal technology, and so the question remains "how anonymous is Diffix?" While it may not be possible to precisely answer that question, one way we try to answer that question is through a bounty program: we pay attackers who can demonstrate how to compromise anonymity in our system. Last year we ran the first (and still only) bounty program for anonymity. The program was successful in that some attacks were found, and in the process of designing defensive measures, Diffix has improved.

In order to run the bounty program, we naturally needed a measure of anonymity so that we could decide how much to pay attackers. We designed a measure based on how much confidence an attacker has in a prediction of an individual's data values, among other things. At some point, we realized that our measure applies not just to attacks on Diffix, but to any anonymization system. We also realized that our measure might be useful in the design of certification programs for anonymity.

We decided to develop a general score for anonymity, and to build tools that would allow anyone to apply the measure to any anonymization technology. The score is called the GDA Score, for General Data Anonymity Score.

The primary strength of the GDA Score is that it can be applied to any anonymization method, and therefore is apples-to-apples. The primary weakness is that it is based on empirical attacks (real attacks against real systems), and therefore the score is only as good as the attacks themselves. If there are unknown attacks on a system, then the score won't reflect this and may therefore make a system look more anonymous than it is.

Our hope is that over time enough attacks will be developed that we can have high confidence in the GDA Score. Towards that end, we've started the Open GDA Score Project. This is a community effort to provide software and databases in support of developing new attacks, and a repository where the scores can be viewed. We recently launched the project in the form of a website, www.gda-score.org, and some initial tools and simple attacks. We will continue to develop tools and new attacks, but our goal is to attract broad participation from the community.

For more information, visit www.gda-score.org.

Francis' group launches Open GDA Score Project

January 2019
MPI-SWS Director Paul Francis and his group have launched the Open GDA Score Project at www.gda-score.org.  This is an open project to develop a set of tools and databases to generate anonymity scores for any data anonymization technique. The GDA Score, which stands for General Data Anonymity Score, is the first data anonymization measurement methodology that works with any anonymization technique. The GDA Score is a generalization of the measurement technique developed by Francis' group for the Diffix bounty program run last year. This was the first bounty program for anonymity. The GDA Score is of particular interest in Europe, where member states are expected to produce certification programs for anonymity.