Events 2023

Improving Decision Making with Machine Learning, Provably

Manuel Gomez Rodriguez Max Planck Institute for Software Systems
06 Dec 2023, 12:15 pm - 1:15 pm
Saarbrücken building E1 5, room 002
Joint Lecture Series
Decision support systems for classification tasks are predominantly designed to predict the value of the ground truth labels.  However, these systems also need to make human experts understand when and how to use these predictions to update their own  predictions. Unfortunately, this has been proven challenging. In this talk, I will introduce an alternative type of decision support systems  that circumvent this challenge by design. Rather than providing a single label prediction, these systems provide a set of label prediction  values, ...
Decision support systems for classification tasks are predominantly designed to predict the value of the ground truth labels.  However, these systems also need to make human experts understand when and how to use these predictions to update their own  predictions. Unfortunately, this has been proven challenging. In this talk, I will introduce an alternative type of decision support systems  that circumvent this challenge by design. Rather than providing a single label prediction, these systems provide a set of label prediction  values, namely a prediction set, and forcefully ask experts to predict a label value from the prediction set. Moreover, I will discuss how to use conformal prediction, online learning and counterfactual inference to efficiently construct prediction sets that optimize experts’ performance,  provably. Further, I will present the results of a large-scale human subject study, which show that, for decision support systems based on  prediction sets, limiting experts’ level of agency leads to greater performance than allowing experts to always exercise their own agency.
Read more

Automated and Foundational Verification of Low-Level Programs

Michael Sammler MPI-SWS
04 Dec 2023, 5:00 pm - 6:00 pm
Saarbrücken building E1 5, room 029
simultaneous videocast to Kaiserslautern building G26, room 113
SWS Student Defense Talks - Thesis Defense
Formal verification is a promising technique to ensure the reliability of low-level programs like operating systems and hypervisors, since it can show the absence of whole classes of bugs and prevent critical vulnerabilities. To realize the full potential of formal verification for real-world low-level programs, however one has to overcome several challenges, including: (1) dealing with the complexities of realistic models of real-world programming languages; (2) ensuring the trustworthiness of the verification, ideally by providing foundational proofs (i.e., ...
Formal verification is a promising technique to ensure the reliability of low-level programs like operating systems and hypervisors, since it can show the absence of whole classes of bugs and prevent critical vulnerabilities. To realize the full potential of formal verification for real-world low-level programs, however one has to overcome several challenges, including: (1) dealing with the complexities of realistic models of real-world programming languages; (2) ensuring the trustworthiness of the verification, ideally by providing foundational proofs (i.e., proofs that can be checked by a general-purpose proof assistant); and (3) minimizing the manual effort required for verification by providing a high degree of automation.

This dissertation presents multiple projects that advance formal verification along these three axes: RefinedC provides the first approach for verifying C code that combines foundational proofs with a high degree of automation via a novel refinement and ownership type system. Islaris shows how to scale verification of assembly code to realistic models of modern instruction set architectures-in particular, Armv8-A and RISC-V. DimSum develops a decentralized approach for reasoning about programs that consist of components written in multiple different languages (e.g., assembly and C), as is common for low-level programs. RefinedC and Islaris rest on Lithium, a novel proof engine for separation logic that combines automation with foundational proofs.
Read more

The complexity of Presburger arithmetic with power or powers

Dmitry Chistikov University of Warwick
14 Nov 2023, 11:00 am - 12:00 pm
Kaiserslautern building G26, room 111
SWS Colloquium
Presburger arithmetic, or linear integer arithmetic, is known to have decision procedures that work in triply exponential time.

Jointly with M. Benedikt (Oxford) and A. Mansutti (IMDEA Software Institute), we have recently considered two decidable extensions of Presburger arithmetic: with the power function and with the predicate for the set of powers of 2. No elementary decision procedures were known for these two theories.

In this talk, I will introduce this work and outline ideas behind our results. ...
Presburger arithmetic, or linear integer arithmetic, is known to have decision procedures that work in triply exponential time.

Jointly with M. Benedikt (Oxford) and A. Mansutti (IMDEA Software Institute), we have recently considered two decidable extensions of Presburger arithmetic: with the power function and with the predicate for the set of powers of 2. No elementary decision procedures were known for these two theories.

In this talk, I will introduce this work and outline ideas behind our results. Namely, we have shown that existence of solutions over N to systems of linear equations and constraints of the form $y = 2^x$ can be decided in nondeterministic exponential time. Also, linear integer arithmetic extended with a predicate for powers of 2 can be decided in triply exponential time.

(Based on a paper in ICALP'23.)
Read more

Digitizing the Human Face

Mohamed Elgharib MPI-INF - D6
08 Nov 2023, 12:15 pm - 1:15 pm
Saarbrücken building E1 5, room 002
Joint Lecture Series
Digitization is the process of creating digital visual twins of real objects and enabling interactions with them in a seamless manner. The human face is a central element of our visual communication and hence its digitization is of utmost importance in visual computing. This has several applications in videoconferencing, cinema production, video games, virtual environments, and many more. In this talk, we will discuss the digitization of the human face. To this end, we will highlight the main stages of the digitization pipeline, ...
Digitization is the process of creating digital visual twins of real objects and enabling interactions with them in a seamless manner. The human face is a central element of our visual communication and hence its digitization is of utmost importance in visual computing. This has several applications in videoconferencing, cinema production, video games, virtual environments, and many more. In this talk, we will discuss the digitization of the human face. To this end, we will highlight the main stages of the digitization pipeline, and we will outline the main pillars of digitizing the human face. We will show how to achieve these pillars by utilizing the most recent advances in learnable neural implicit scene representations. We will discuss specific face digitization problems in more detail and show important features that learnable neural implicit scene representations can bring to the digitization process. This talk highlights our efforts towards the full digitization of our world, with an ultimate goal of allowing all of us to connect and interact with our friends, family, and loved ones, over a distance.
Read more

Exposing Concurrency Bugs from their Hiding Places

Umang Mathur National University of Singapore
26 Oct 2023, 10:30 am - 11:30 am
Kaiserslautern building G26, room 111
simultaneous videocast to Saarbrücken building E1 5, room 029
SWS Colloquium
Concurrent programs are notoriously hard to write correctly, as scheduling nondeterminism introduces subtle errors that are both hard to detect and to reproduce.

Despite rigorous testing, concurrency bugs such as races conditions often find their way into production software, and manifest as critical security issues. Consequently, considerable effort has been made towards developing efficient techniques for detecting them automatically.

The preferred approach to detect data races is through dynamic analysis, where one executes the software with some test inputs, ...
Concurrent programs are notoriously hard to write correctly, as scheduling nondeterminism introduces subtle errors that are both hard to detect and to reproduce.

Despite rigorous testing, concurrency bugs such as races conditions often find their way into production software, and manifest as critical security issues. Consequently, considerable effort has been made towards developing efficient techniques for detecting them automatically.

The preferred approach to detect data races is through dynamic analysis, where one executes the software with some test inputs, and checks for the presence of bugs in the execution observed.

Traditional bug detectors however are often unable to discover simple bugs present in the underlying software, even after executing the program several times, because these bugs are sensitive to thread scheduling.

In this talk, I will discuss how runtime predictive analysis can help. Runtime predictive analyses aim to expose concurrency bugs, that can be otherwise missed by traditional dynamic analysis techniques (such as the race detector TSan), by inferring the presence of these bugs in alternate executions of the underlying software, without explicitly re-executing the software program.

I will talk about the fundamentals of and recent algorithmic advances for building highly scalable and sound predictive analysis techniques.
Read more

Naturalness & Bimodality of Code

Prem Devanbu University California, Davis
18 Oct 2023, 10:00 am - 11:00 am
Saarbrücken building E1 5, room 002
simultaneous videocast to Kaiserslautern building G26, room 111
SWS Distinguished Lecture Series
After discovering, back in 2011, that Language Models are useful for modeling repetitive patterns in source code (c.f. The "Naturalness" of software ), and exploring some applications thereof, more recently (since about 2019) our group at UC Davis has focused on the observation that Software, as usually written, is bimodal, admitting both the well-known formal, deterministic semantics (mostly for machines) and probabilistic, noisy semantics (for humans). This bimodality property affords both new approaches to software tool construction (using machine-learning) and new ways of studying human code reading. ...
After discovering, back in 2011, that Language Models are useful for modeling repetitive patterns in source code (c.f. The "Naturalness" of software ), and exploring some applications thereof, more recently (since about 2019) our group at UC Davis has focused on the observation that Software, as usually written, is bimodal, admitting both the well-known formal, deterministic semantics (mostly for machines) and probabilistic, noisy semantics (for humans). This bimodality property affords both new approaches to software tool construction (using machine-learning) and new ways of studying human code reading. In this talk, I'll give an overview of the Naturalness/Bimodality program, and some recent work we have done on calibrating the quality of code produced by large language models, and also on "bimodal prompting".

--

Please contact the office team for link information
Read more

Algorithms for Plurality

Smitha Milli Cornell Tech
17 Oct 2023, 10:00 am - 11:00 am
Saarbrücken building E1 5, room 029
SWS Colloquium
Machine learning algorithms curate much of the content we encounter online. However, there is concern that these algorithms may unintentionally amplify hostile discourse and perpetuate divisive 'us versus them' mentalities. How can we re-engineer algorithms to bridge diverse perspectives and facilitate constructive conflict? First, I will discuss results from our randomized experiment measuring effects of Twitter’s engagement-based ranking algorithm on downstream sociopolitical outcomes like the amplification of divisive content and users’ perceptions of their in-group and out-group. ...
Machine learning algorithms curate much of the content we encounter online. However, there is concern that these algorithms may unintentionally amplify hostile discourse and perpetuate divisive 'us versus them' mentalities. How can we re-engineer algorithms to bridge diverse perspectives and facilitate constructive conflict? First, I will discuss results from our randomized experiment measuring effects of Twitter’s engagement-based ranking algorithm on downstream sociopolitical outcomes like the amplification of divisive content and users’ perceptions of their in-group and out-group. Crucially, we found that an alternative ranking, based on users’ stated preferences rather than their engagement, reduced amplification of negative, partisan, and out-group hostile content. Second, I will discuss how we applied these insights in practice to design an objective function for algorithmic ranking at Twitter. The core idea to the approach is to interpret users' actions in a way that is consistent with their stated, reflective preferences. Finally, I will discuss lessons learned and open questions for designing algorithms that support a plurality of viewpoints, with an emphasis on the emerging paradigm of bridging-based ranking.
Read more

Causethical ML: Promises & Challenges

Isabel Valera Fachrichtung Informatik - Saarbrücken
11 Oct 2023, 12:15 pm - 1:15 pm
Saarbrücken building E1 5, room 002
Joint Lecture Series
In this talk I will give an overview of the role of causality in ethical machine learning, and in particular, in fair and explainable ML. In particular, I will first detail how to use causal reasoning to  study fairness and interpretability problems in algorithmic decision making, stressing the main  limitations that we encounter when aiming to address these problems in practice.  Then, I will provide some hints about how to solve some of these limitations.
In this talk I will give an overview of the role of causality in ethical machine learning, and in particular, in fair and explainable ML. In particular, I will first detail how to use causal reasoning to  study fairness and interpretability problems in algorithmic decision making, stressing the main  limitations that we encounter when aiming to address these problems in practice.  Then, I will provide some hints about how to solve some of these limitations.

Software Engineering for Data Intensive Scalable Computing and Heterogeneous Computing

Miryung Kim UCLA
28 Sep 2023, 10:30 am - 11:30 am
Kaiserslautern building G26, room 111
simultaneous videocast to Saarbrücken building E1 5, room 029
SWS Distinguished Lecture Series
With the development of big data, machine learning, and AI, existing software engineering techniques must be re-imagined to provide the productivity gains that developers desire. Furthermore, specialized hardware accelerators like GPUs or FPGAs have become a prominent part of the current computing landscape. However, developing heterogeneous applications is limited to a small subset of programmers with specialized hardware knowledge. To improve productivity and performance for data-intensive and compute-intensive development, now is the time that the software engineering community should design new waves of refactoring, ...
With the development of big data, machine learning, and AI, existing software engineering techniques must be re-imagined to provide the productivity gains that developers desire. Furthermore, specialized hardware accelerators like GPUs or FPGAs have become a prominent part of the current computing landscape. However, developing heterogeneous applications is limited to a small subset of programmers with specialized hardware knowledge. To improve productivity and performance for data-intensive and compute-intensive development, now is the time that the software engineering community should design new waves of refactoring, testing, and debugging tools for big data analytics and heterogeneous application development.

In this talk, we overview software development challenges in this new data-intensive scalable computing and heterogeneous computing domain. We describe examples of automated software engineering (debugging, testing, and refactoring) techniques that target this new domain and share lessons learned from building these techniques.
Read more

Robust and Equitable Uncertainty Estimation

Aaron Roth University of Pennsylvania
27 Sep 2023, 2:00 pm - 3:00 pm
Virtual talk
SWS Distinguished Lecture Series
Machine learning provides us with an amazing set of tools to make predictions, but how much should we trust particular predictions? To answer this, we need a way of estimating the confidence we should have in particular predictions of black-box models. Standard tools for doing this give guarantees that are averages over predictions. For instance, in a medical application, such tools might paper over poor performance on one medically relevant demographic group if it is made up for by higher performance on another group. ...
Machine learning provides us with an amazing set of tools to make predictions, but how much should we trust particular predictions? To answer this, we need a way of estimating the confidence we should have in particular predictions of black-box models. Standard tools for doing this give guarantees that are averages over predictions. For instance, in a medical application, such tools might paper over poor performance on one medically relevant demographic group if it is made up for by higher performance on another group. Standard methods also depend on the data distribution being static — in other words, the future should be like the past.

In this talk, I will describe new techniques to address both these problems: a way to produce prediction sets for arbitrary black-box prediction methods that have correct empirical coverage even when the data distribution might change in arbitrary, unanticipated ways and such that we have correct coverage even when we zoom in to focus on demographic groups that can be arbitrary and intersecting. When we just want correct group-wise coverage and are willing to assume that the future will look like the past, our algorithms are especially simple.

This talk is based on two papers, that are joint work with Osbert Bastani, Varun Gupta, Chris Jung, Georgy Noarov, and Ramya Ramalingam.

Please contact the office team for link information
Read more

AI as a resource: strategy, uncertainty, and societal welfare

Kate Donahue Cornell University
27 Sep 2023, 10:00 am - 11:00 am
Kaiserslautern building G26, room 111
simultaneous videocast to Kaiserslautern building E1 5, room 029
SWS Colloquium
In recent years, humanity has been faced with a new resource - artificial intelligence. AI can be a boon to society, or can also have negative impacts, especially with inappropriate use. My research agenda studies the societal impact of AI, particularly focusing on AI as a resource and on the strategic decisions that agents make in deciding how to use it. In this talk, I will consider some of the key strategic questions that arise in this framework: the decisions that agents make in jointly constructing and sharing AI models, ...
In recent years, humanity has been faced with a new resource - artificial intelligence. AI can be a boon to society, or can also have negative impacts, especially with inappropriate use. My research agenda studies the societal impact of AI, particularly focusing on AI as a resource and on the strategic decisions that agents make in deciding how to use it. In this talk, I will consider some of the key strategic questions that arise in this framework: the decisions that agents make in jointly constructing and sharing AI models, and the decisions that they make in dividing tasks between their own expertise and the expertise of a model. The first of these questions has motivated my work on "model-sharing games", which models scenarios such as federated learning or data cooperatives. In this setting, we view agents with data as game-theoretic players and analyze questions of stability, optimality, and fairness (https://arxiv.org/abs/2010.00753, https://arxiv.org/abs/2106.09580, https:// arxiv.org/abs/2112.00818). Secondly, I will describe some of my ongoing work in modeling human-algorithm collaboration. In particular, I will describe work on best-item recovery in categorical prediction, showing how differential accuracy rates and anchoring on algorithmic suggestions can influence overall performance (https://arxiv.org/abs/2308.11721).
Read more

Privacy Auditing and Protection in Large Language Models

Fatemehsadat Mireshghallah University of Washington
18 Sep 2023, 10:00 am - 11:30 am
Saarbrücken building E1 5, room 029
simultaneous videocast to Kaiserslautern building G26, room 112
SWS Colloquium
Large language Models (LLMs, e.g., GPT-3, OPT, TNLG,…) are shown to have a remarkably high performance on standard benchmarks, due to their high parameter count, extremely large training datasets, and significant compute. Although the high parameter count in these models leads to more expressiveness, it can also lead to higher memorization, which, coupled with large unvetted, web-scraped datasets can cause different negative societal and ethical impacts such as leakage of private, sensitive information and generation of harmful text. ...
Large language Models (LLMs, e.g., GPT-3, OPT, TNLG,…) are shown to have a remarkably high performance on standard benchmarks, due to their high parameter count, extremely large training datasets, and significant compute. Although the high parameter count in these models leads to more expressiveness, it can also lead to higher memorization, which, coupled with large unvetted, web-scraped datasets can cause different negative societal and ethical impacts such as leakage of private, sensitive information and generation of harmful text. In this talk, we will go over how these issues affect the trustworthiness of LLMs, and zoom in on how we can measure the leakage and memorization of these models, and mitigate it through differentially private training. Finally we will discuss what it would actually mean for LLMs to be privacy preserving, and what are the future research directions on making large models trustworthy.
Read more

Next-Generation Optical Networks for Machine Learning Jobs

Manya Ghobadi MIT
04 Sep 2023, 10:00 am - 11:00 am
Kaiserslautern building G26, room 111
SWS Distinguished Lecture Series
In this talk, I will explore three elements of designing next-generation machine learning systems: congestion control, network topology, and computation frequency. I will show that fair sharing, the holy grail of congestion control algorithms, is not necessarily desirable for deep neural network training clusters. Then I will introduce a new optical fabric that optimally combines network topology and parallelization strategies for machine learning training clusters. Finally, I will demonstrate the benefits of leveraging photonic computing systems for real-time, ...
In this talk, I will explore three elements of designing next-generation machine learning systems: congestion control, network topology, and computation frequency. I will show that fair sharing, the holy grail of congestion control algorithms, is not necessarily desirable for deep neural network training clusters. Then I will introduce a new optical fabric that optimally combines network topology and parallelization strategies for machine learning training clusters. Finally, I will demonstrate the benefits of leveraging photonic computing systems for real-time, energy-efficient inference via analog computing. I will discuss that pushing the frontiers of optical networks for machine learning workloads will enable us to fully harness the potential of deep neural networks and achieve improved performance and scalability.
Read more

Inferring the 3D World from Incomplete Observations: Representations and Data Priors

Jan Eric Lenssen MPI-INF - D2
05 Jul 2023, 12:15 pm - 1:15 pm
Saarbrücken building E1 5, room 002
Joint Lecture Series
Computer Vision has become increasingly capable of representing the 3D world, given large sets of dense and complete observations, such as images or 3D scans. However, in comparison with humans, we still lack an important ability: deriving 3D representations from just a few, incomplete 2D observations. Replicating this ability might be the next important step towards a general vision system. A key aspect of the human abilities is that observations are complemented by previously learned information: the world is not only sensed - to a large degree it is inferred. ...
Computer Vision has become increasingly capable of representing the 3D world, given large sets of dense and complete observations, such as images or 3D scans. However, in comparison with humans, we still lack an important ability: deriving 3D representations from just a few, incomplete 2D observations. Replicating this ability might be the next important step towards a general vision system. A key aspect of the human abilities is that observations are complemented by previously learned information: the world is not only sensed - to a large degree it is inferred. The common way to approach this task with deep learning are data priors, which capture information present in large datasets and which are used to perform inference from novel observations. This talk will discuss 3D data priors and their important connection to 3D representations. Choosing the right representation, we can have abstract control over which information is learned from data and how we can use it during inference, which leads to more effective solutions than simply learning everything end-to-end. Thus, the focus of my research and this talk will be on representations with important properties, such as data efficiency and useful equi- and invariances, which enable the formulation of sophisticated, task-specific data priors. These presented concepts are showcased on examples from my and collaborating groups, e.g., as data priors for reconstructing objects or object interaction sequences from incomplete observations.
Read more

On Synthesizability of Skolem Functions in First-Order Theories

Supratik Chakraborty IIT Bombay
20 Jun 2023, 3:00 pm - 4:00 pm
Kaiserslautern building G26, room 207
simultaneous videocast to Saarbrücken building E1 5, room 029
SWS Colloquium
Given a sentence $\forall X \exists Y \varphi(X, Y)$ in a first-order theory, it is well-known that there exists a function $F(X)$ for $Y$ in $\varphi$ such that $\exists Y \varphi(X, Y)\leftrightarrow \varphi(X, F(X))$ holds for all values of the universal variables X. Such a function is also called a Skolem function, in honour of Thoralf Skolem who first made us of this in proving what are now known as the Lowenheim-Skolem theorems. The existence of a Skolem function for a given formula is technically analogous to the Axiom of Choice -- it doesn't give us any any hint about how to compute the function, ...
Given a sentence $\forall X \exists Y \varphi(X, Y)$ in a first-order theory, it is well-known that there exists a function $F(X)$ for $Y$ in $\varphi$ such that $\exists Y \varphi(X, Y)\leftrightarrow \varphi(X, F(X))$ holds for all values of the universal variables X. Such a function is also called a Skolem function, in honour of Thoralf Skolem who first made us of this in proving what are now known as the Lowenheim-Skolem theorems. The existence of a Skolem function for a given formula is technically analogous to the Axiom of Choice -- it doesn't give us any any hint about how to compute the function, although we know such a function exists. Nevertheless, since Skolem functions are often very useful in practical applications (like finding a strategy for a reactive controller), we investigate when is it possible to algorithmically construct a Turing machine that computes a Skolem function for a given first-order formula. We show that under fairly relaxed conditions, this cannot be done. Does this mean the end of the road for automatic synthesis of Skolem functions? Fortunately, no. We show model-theoretic necessary and sufficient condition for the existence and algorithmic synthesizability of Turing machines implementing Skolem functions. We show that several useful first-order theories satisfy these conditions, and hence admit algorithms that can synthesize Turing machines implementing Skolem functions. We conclude by presenting several open problems in this area.
Read more

Why do large language models align with human brains: insights, opportunities, and challenges

Mariya Toneva Max Planck Institute for Software Systems
07 Jun 2023, 12:15 pm - 1:15 pm
Saarbrücken building E1 5, room 002
Joint Lecture Series
Language models that have been trained to predict the next word over billions of text documents have been shown to also significantly predict brain recordings of people comprehending language. Understanding the reasons behind the observed similarities between language in machines and language in the brain can lead to more insight into both systems. In this talk, we will discuss a series of recent works that make progress towards this question along different dimensions. The unifying principle among these works that allows us to make scientific claims about why one black box (language model) aligns with another black box (the human brain) is our ability to make specific perturbations in the language model and observe their effect on the alignment with the brain. ...
Language models that have been trained to predict the next word over billions of text documents have been shown to also significantly predict brain recordings of people comprehending language. Understanding the reasons behind the observed similarities between language in machines and language in the brain can lead to more insight into both systems. In this talk, we will discuss a series of recent works that make progress towards this question along different dimensions. The unifying principle among these works that allows us to make scientific claims about why one black box (language model) aligns with another black box (the human brain) is our ability to make specific perturbations in the language model and observe their effect on the alignment with the brain. Building on this approach, these works reveal that the observed alignment is due to more than next-word prediction and word-level semantics and is partially related to joint processing of select linguistic information in both systems. Furthermore, we find that the brain alignment can be improved by training a language model to summarize narratives. Taken together, these works make progress towards determining the sufficient and necessary conditions under which language in machines aligns with language in the brain.
Read more

Scalable and Sustainable Data-Intensive Systems

Bo Zhao Lecturer Assistant Professor in Computer Science at Queen Mary University of London and an Honorary Research Fellow at Imperial College London
25 May 2023, 4:00 pm - 5:00 pm
Saarbrücken building E1 5, room 002
simultaneous videocast to Kaiserslautern building G26, room 207
SWS Colloquium
Efficient data-intensive systems translate data into value for decision making. As data is collected at unprecedented rates for timely analysis, the model-centric paradigm of machine learning (ML) is shifting towards a data-centric and system-centric paradigm. Recent breakthroughs in large ML models (e.g., GPT 4 and ChatGPT) and the remarkable outcomes of reinforcement learning (e.g., AlphaFold and AlphaCode) have shown that scalable data management and its optimizations are critical to obtain state-of-the-art performance. This talk aims to answer the question "how to co-design multiple layers of the software/system stack to improve scalability, ...
Efficient data-intensive systems translate data into value for decision making. As data is collected at unprecedented rates for timely analysis, the model-centric paradigm of machine learning (ML) is shifting towards a data-centric and system-centric paradigm. Recent breakthroughs in large ML models (e.g., GPT 4 and ChatGPT) and the remarkable outcomes of reinforcement learning (e.g., AlphaFold and AlphaCode) have shown that scalable data management and its optimizations are critical to obtain state-of-the-art performance. This talk aims to answer the question "how to co-design multiple layers of the software/system stack to improve scalability, performance, and energy efficiency of ML and data-intensive systems". It addresses the challenges to build fully automated data-intensive systems that integrate the ML layer, the data management layer, and the compilation-based optimization layer. Finally, this talk will sketch and explore the vision to leverage the computational advantage of quantum computing on hybrid classic/quantum systems in the post-Moore era.

Please contact office for zoom link information.
Read more

A Generic Solution to Register-bounded Synthesis for Systems over Data words

Léo Exibard Icelandic Centre of Excellence in Theoretical Computer Science at Reykjavik University,
12 May 2023, 1:30 pm - 2:45 pm
Kaiserslautern building G26, room 111
SWS Colloquium
In this talk, we consider synthesis of reactive systems interacting with environments using an infinite data domain. A popular formalism for specifying and modelling those systems is register automata and transducers. They extend finite-state automata by adding registers to store data values and to compare the incoming data values against stored ones. Synthesis from nondeterministic or universal register automata is undecidable in general. However, its register-bounded variant, where additionally a bound on the number of registers in a sought transducer is given, ...
In this talk, we consider synthesis of reactive systems interacting with environments using an infinite data domain. A popular formalism for specifying and modelling those systems is register automata and transducers. They extend finite-state automata by adding registers to store data values and to compare the incoming data values against stored ones. Synthesis from nondeterministic or universal register automata is undecidable in general. However, its register-bounded variant, where additionally a bound on the number of registers in a sought transducer is given, is known to be decidable for universal register automata which can compare data for equality, i.e., for data domain (N,=).

After briefly reviewing this result, we extend it to the domain (N,<) of natural numbers with linear order. Our solution is generic: we define a sufficient condition on data domains (regular approximability) for decidability of register-bounded synthesis. It allows one to use simple language-theoretic arguments and avoid technical game-theoretic reasoning. Further, by defining a generic notion of reducibility between data domains, we show the decidability of synthesis in the domain (N^d,<^d) of tuples of numbers equipped with the component-wise partial order and in the domain (Σ*, ≺) of finite strings with the prefix relation.
Read more

Making monkeys and ducks behave with Crystal Lang

Beta Ziliani Manas.Tech and FAMAF, Universidad Nacional de CórdobaArgentina
04 May 2023, 3:00 pm - 4:00 pm
Saarbrücken building E1 5, room 002
simultaneous videocast to Kaiserslautern building G26, room 111
SWS Colloquium
In the zoo of programming languages there are two cute yet rather misbehaved animals, typically found in the Dynamic Languages section: Duck Typing and Monkey Patching.

Duck Typing is hardly seen. You hear a "quack!", but you can’t easily tell if it’s coming form an actual duck, a parrot, or a recording. Monkey Patching, like the name suggests, patches any existing creature to change their behavior. It can even make a dog quack!

While these two animals bring lots of joy, ...
In the zoo of programming languages there are two cute yet rather misbehaved animals, typically found in the Dynamic Languages section: Duck Typing and Monkey Patching.

Duck Typing is hardly seen. You hear a "quack!", but you can’t easily tell if it’s coming form an actual duck, a parrot, or a recording. Monkey Patching, like the name suggests, patches any existing creature to change their behavior. It can even make a dog quack!

While these two animals bring lots of joy, they are also quite dangerous when used in the wild, as they can bring unexpected behavior to the rest of the creatures.

Crystal is a rarity among Static Languages in that it has Duck Typing and Monkey Patching. Given the strong —yet barely visible— fences of types, it manages to properly contain these beasts. In this talk I will present Crystal and provide a glimpse at how it manages to feel so dynamic.

Please contact office for zoom link information.
Read more

Digital Humans: From Sensor Measurements to Deeper Understanding and Synthesis

Marc Habermann MPI-INF - D6
03 May 2023, 12:15 pm - 1:15 pm
Saarbrücken building E1 5, room 002
Joint Lecture Series
The earliest paintings depicting a human date back to the Stone Age. Since that, sensing devices have been improved and nowadays, digital sensing can be found in everyone’s home. Nonetheless, the main element in many image compositions still is the human, i.e. most of the images one finds in the media, such as on the Internet or in textbooks and magazines, contain humans as the main point of attention. Since sensing of humans became more accurate, ...
The earliest paintings depicting a human date back to the Stone Age. Since that, sensing devices have been improved and nowadays, digital sensing can be found in everyone’s home. Nonetheless, the main element in many image compositions still is the human, i.e. most of the images one finds in the media, such as on the Internet or in textbooks and magazines, contain humans as the main point of attention. Since sensing of humans became more accurate, automated, affordable, and most importantly digital, it comes along with many opportunities in downstream applications such as telepresence, AR/VR, and health care, to only name a few. At the same time, this development also introduces major challenges, since raw sensing measurements cannot be immediately processed by those applications. Instead, it requires algorithms, which are capable of automatically analyzing human-related information and even synthesizing new content. In this talk, I will present some of our recent methods on the analysis and synthesis (rendering) of humans from digital measurements on the basis of Graphics, Vision, and Machine Learning concepts.
Read more

2vyper: Contracts for Smart Contracts

Alexander J. Summers University of British Columbia
27 Apr 2023, 4:00 pm - 5:00 pm
Saarbrücken building E1 5, room 002
simultaneous videocast to Kaiserslautern building G26, room 111
SWS Colloquium
Smart contract languages are increasingly popular and numerous, and their programming models and challenges are somewhat unusual. The ubiquitous presence of untrusted external code in such a system makes classical contracts unsuitable for safety verification, while the intentional presence of (potentially-mutating) callbacks via unknown code makes standard static analysis techniques imprecise in general. On the other hand, smart contract languages such as Vyper (for Ethereum) tightly encapsulate direct access to the program's state. In this talk I'll present a methodology for expressing contracts for this language, ...
Smart contract languages are increasingly popular and numerous, and their programming models and challenges are somewhat unusual. The ubiquitous presence of untrusted external code in such a system makes classical contracts unsuitable for safety verification, while the intentional presence of (potentially-mutating) callbacks via unknown code makes standard static analysis techniques imprecise in general. On the other hand, smart contract languages such as Vyper (for Ethereum) tightly encapsulate direct access to the program's state. In this talk I'll present a methodology for expressing contracts for this language, in a way that supports sound verification of safety properties, with deductive verification tooling (converting Vyper to Viper) to automate the corresponding proofs.

Based on joint work with Christian Bräm, Marco Eilers, Peter Müller and Robin Sierra; see also the accompanying paper at OOPSLA 2021. --- Please contact Office for Zoom link information.
Read more

Automating cryptographic code generation

Yuval Yarom Ruhr University Bochum
24 Apr 2023, 10:00 am - 11:00 am
Saarbrücken building E1 5, room 029
simultaneous videocast to Kaiserslautern building G26, room 111
SWS Colloquium
Cryptography provides the data protection mechanisms that underlie security and privacy in the modern connected world. Given this pivotal role, implementations of cryptographic code must not only be correct, but also meet stringent performance and security requirements. Achieving these aims is often difficult and requires significant investment in software development and manual tuning.

This talk presents two approaches for automating the task of generating correct, secure, and efficient cryptographic code. The first, Rosita, uses a power consumption emulator to detect unintended leaky interactions between values in the microarchitecture. ...
Cryptography provides the data protection mechanisms that underlie security and privacy in the modern connected world. Given this pivotal role, implementations of cryptographic code must not only be correct, but also meet stringent performance and security requirements. Achieving these aims is often difficult and requires significant investment in software development and manual tuning.

This talk presents two approaches for automating the task of generating correct, secure, and efficient cryptographic code. The first, Rosita, uses a power consumption emulator to detect unintended leaky interactions between values in the microarchitecture. It then rewrites the code to eliminate these interactions and produce code that is resistant to power analysis. The second, CryptOpt, uses evolutionary computation to search for the most efficient constant-time implementation of a cryptographic function. It then formally verifies that the produced implementation is semantically equivalent to the original code.

Rosita is a joint work with Lejla Batina, Łukasz Chmielewski, Francesco Regazzoni, Niels Samwel, Madura A. Shelton, and Markus Wagner.CryptOpt is a joint work with Adam Chlipala, Chitchanok Chuengsatiansup, Owen Conoly, Andres Erbsen, Daniel Genkin, Jason Gross, Joel Kuepper, Chuyue Sun, Samuel Tian, Markus Wagner, and David Wu.

Please contact office for zoom link information.
Read more

Faster Approximate String Matching: Now with up to 500 Errors

Philip Wellnitz MPI-INF - D1
05 Apr 2023, 12:15 pm - 1:15 pm
Saarbrücken building E1 5, room 002
Joint Lecture Series
String matching algorithms are everywhere. You use them every day, perhaps even without noticing. Whenever your mail client filters out spam, whenever your spell checker flags a spelling mistake, whenever your search engine of choice presents you with results it deems useful, to list just some examples. For many applications, we need more than finding exact occurrences of a pattern string in a text. Hence, in this talk, I consider the classical problem of finding parts of a text that are close to a given pattern string. ...
String matching algorithms are everywhere. You use them every day, perhaps even without noticing. Whenever your mail client filters out spam, whenever your spell checker flags a spelling mistake, whenever your search engine of choice presents you with results it deems useful, to list just some examples. For many applications, we need more than finding exact occurrences of a pattern string in a text. Hence, in this talk, I consider the classical problem of finding parts of a text that are close to a given pattern string.

In particular, I consider the problem of finding all parts of a given text that can be transformed to a given pattern string by inserting, deleting, or substituting at most k characters—the Approximate String Matching problem. The first algorithms for this problem were developed in the 1970s, some of them are relevant even today. After a long line of research, around the year 2000, the progress in obtaining faster algorithms for Approximate String Matching came to a halt—even though the interest did not fade.

In a recent series of papers, I and coauthors improved the more than 20-year-old state-of-the-art algorithms. In this talk, I highlight the new tools and techniques that we developed for Approximate String Matching. Further, I give an overview of additional applications of our techniques to settings where text and pattern are given in a compressed setting or where we allow text and pattern to change dynamically.
Read more

Automatically Detecting and Mitigating Issues in Program Analyzers

Numair Mansur Max Planck Institute for Software Systems
30 Mar 2023, 3:30 pm - 4:30 pm
Saarbrücken building G26, room 111
SWS Student Defense Talks - Thesis Defense
In recent years, the formal methods community has made significant progress towards the development of industrial-strength static analysis tools that can check properties of real-world production code. Such tools can help developers detect potential bugs and security vulnerabilities in critical software before deployment. While the potential benefits of static analysis tools are clear, their usability and effectiveness in mainstream software development workflows often comes into question and can prevent software developers from using these tools to their full potential. ...
In recent years, the formal methods community has made significant progress towards the development of industrial-strength static analysis tools that can check properties of real-world production code. Such tools can help developers detect potential bugs and security vulnerabilities in critical software before deployment. While the potential benefits of static analysis tools are clear, their usability and effectiveness in mainstream software development workflows often comes into question and can prevent software developers from using these tools to their full potential. In this dissertation, we focus on two major challenges that can prevent them from being incorporated into software development workflows. The first challenge is unintentional unsoundness. Static program analyzers are complicated tools, implementing sophisticated algorithms and performance heuristics. This complexity makes them highly susceptible to undetected unintentional soundness issues. Soundness issues in program analyzers can cause false negatives and have disastrous consequences, e.g. when analysing  safety- critical software. In this dissertation, we present novel techniques to detect unintentional unsoundness bugs in two foundational program analysis tools, namely SMT solvers and Datalog engines. These tools are used extensively by the formal methods community, for instance, in software verification, systematic testing, and program synthesis. We implemented these techniques as easy-to-use open source tools that are publicly available on Github. With the proposed techniques, we were able to detect more than 55 unique and confirmed critical soundness bugs in popular and widely used SMT solvers and Datalog engines in only a few months of testing. The second challenge is finding the right balance between soundness, precision, and performance. In an ideal world, a static analyzer should be as precise as possible while maintaining soundness and being sufficiently fast. However, to overcome undecidability and performance issues, these tools have to employ a variety of techniques to be practical, for example, compromising on the soundness of the analysis or approximating code behaviour.  Moreover, these tools typically don’t scale to large industrial code bases containing millions of lines of code. All of these factors make it extremely challenging to get the most out of these analyzers and integrate them into everyday development activities, especially for average software development teams with little to no knowledge or understanding of advanced static analysis techniques. In this dissertation we present an approach to automatically tailor a static analyzer to the code under analysis and any given resource constraints. We implemented our technique as an open source framework, which is publicly available on Github. The second contribution of this dissertation in this challenge area is a technique to horizontally scale analysis tools in cloud-based static analysis platforms by splitting the input to the analyzer into partitions and analyzing the partitions independently. The technique was developed in collaboration with Amazon Web Services and is now being used in production in their CodeGuru service.
Read more

Statistical inference with privacy and computational constraints

Maryam Aliakbarpour Boston University and Northeastern University
02 Mar 2023, 9:30 am - 10:30 am
Kaiserslautern building G26, room 111
simultaneous videocast to Saarbrücken building E1 5, room 002
SWS Colloquium
The vast amount of digital data we create and collect has revolutionized many scientific fields and industrial sectors. Yet, despite our success in harnessing this transformative power of data, computational and societal trends emerging from the current practices of data science necessitate upgrading our toolkit for data analysis. In this talk, we discuss how practical considerations such as privacy and memory limits affect statistical inference tasks. In particular, we focus on two examples: First, we consider hypothesis testing with privacy constraints. ...
The vast amount of digital data we create and collect has revolutionized many scientific fields and industrial sectors. Yet, despite our success in harnessing this transformative power of data, computational and societal trends emerging from the current practices of data science necessitate upgrading our toolkit for data analysis. In this talk, we discuss how practical considerations such as privacy and memory limits affect statistical inference tasks. In particular, we focus on two examples: First, we consider hypothesis testing with privacy constraints. More specifically, how one can design an algorithm that tests whether two data features are independent or correlated with a nearly-optimal number of data points while preserving the privacy of the individuals participating in the data set. Second, we study the problem of entropy estimation of a distribution by streaming over i.i.d. samples from it. We determine how bounded memory affects the number of samples we need to solve this problem. Please contact office for zoom link information
Read more

Societal Computing

Ingmar Weber Fachrichtung Informatik - Saarbrücken
01 Mar 2023, 12:15 pm - 1:15 pm
Saarbrücken building E1 5, room 002
Joint Lecture Series
Technology and new communication tools have permeated all aspects of our lives. From finding a romantic partner to booking the next holiday - our choices and transactions are increasingly mediated by digital tools and online platforms, leaving behind digital traces. Simultaneously, earth observation satellites provide a real-time window into global changes, letting us observe anything from urbanization to garden parties during covid-19 lockdowns. This "digitization of everything" creates a wide range of opportunities as well as challenges by letting us monitor and interact with society at scales and speeds never seen before. ...
Technology and new communication tools have permeated all aspects of our lives. From finding a romantic partner to booking the next holiday - our choices and transactions are increasingly mediated by digital tools and online platforms, leaving behind digital traces. Simultaneously, earth observation satellites provide a real-time window into global changes, letting us observe anything from urbanization to garden parties during covid-19 lockdowns. This "digitization of everything" creates a wide range of opportunities as well as challenges by letting us monitor and interact with society at scales and speeds never seen before. Societal Computing tries to make use of these afforded opportunities by advancing computational research on both (i) computing _of_  society, i.e., measuring, describing, and understanding societal phenomena, and (ii) computing _for_ society, i.e., working with partners on digitally assisted interventions to address societal challenges.   In this talk, I’ll present examples of Societal Computing work done by myself and interdisciplinary collaborators in the domain of international development and humanitarian response. Apart from highlighting advances in data innovation, I’ll describe real-world ethical challenges related to fairness, such as "leaving behind" the most vulnerable during crisis response, and risks of abuse, such as mapping undocumented migrants. I’ll close by outlining plans for an Interdisciplinary Institute for Societal Computing (IDISC) with the goal of strengthening research collaborations between computer science and both the social sciences and humanities.
Read more

The Power of Feedback in a Cyber-Physical World

Dr. Anne-Kathrin Schmuck Max Planck Institute for Software Systems
28 Feb 2023, 9:30 am - 10:30 am
Kaiserslautern building G26, room 111
simultaneous videocast to Saarbrücken building E1 5, room 029
SWS Colloquium
Feedback allows systems to seamlessly and instantaneously adapt their behavior to their environment and is thereby the fundamental principle of life and technology -- it lets animals breathe, it stabilizes the climate, it allows airplanes to fly, and the energy grid to operate. During the last century, control technology excelled at using this power of feedback to engineer extremely stable, robust, and reliable technological systems.

With the ubiquity of computing devices in modern technological systems, ...
Feedback allows systems to seamlessly and instantaneously adapt their behavior to their environment and is thereby the fundamental principle of life and technology -- it lets animals breathe, it stabilizes the climate, it allows airplanes to fly, and the energy grid to operate. During the last century, control technology excelled at using this power of feedback to engineer extremely stable, robust, and reliable technological systems.

With the ubiquity of computing devices in modern technological systems, feedback loops become cyber-physical -- the laws of physics governing technological, social or biological processes interact with (cyber) computing systems in a highly nontrivial manner, pushing towards higher and higher levels of autonomy and self-regulation. While reliability of these systems remains of utmost importance, a fundamental understanding of cyber-physical feedback loops for large-scale CPS is lacking far behind.

In this talk I will discuss how a control-inspired view on formal methods for reliable software design enables us to utilize the power of feedback for robust and reliable self-adaptation in cyber-physical system design.

Please contact office team for link information.
Read more

Fusing AI and Formal Methods for Automated Synthesis

Priyanka Golia NUS, Singapore and IIT Kanpur
23 Feb 2023, 9:30 am - 10:30 am
Kaiserslautern building G26, room 111
SWS Colloquium
We entrust large parts of our daily lives to computer systems, which are becoming increasingly more complex. Developing scalable yet trustworthy techniques for designing and verifying such systems is an important problem. In this talk, our focus will be on automated synthesis, a technique that uses formal specifications to automatically generate systems (such as functions, programs, or circuits) that provably satisfy the requirements of the specification. I will introduce a state-of-the-art synthesis algorithm that leverages artificial intelligence to provide an initial guess for the system, ...
We entrust large parts of our daily lives to computer systems, which are becoming increasingly more complex. Developing scalable yet trustworthy techniques for designing and verifying such systems is an important problem. In this talk, our focus will be on automated synthesis, a technique that uses formal specifications to automatically generate systems (such as functions, programs, or circuits) that provably satisfy the requirements of the specification. I will introduce a state-of-the-art synthesis algorithm that leverages artificial intelligence to provide an initial guess for the system, and then uses formal methods to repair and verify the guess to synthesize probably correct system. I will conclude by exploring the potential for combining AI and formal methods to address real-world scenarios. Please contact the office team for link information.
Read more

Learning for Decision Making: A Tale of Complex Human Preferences

Leqi Liu Carnegie Mellon University
14 Feb 2023, 2:00 pm - 3:00 pm
Virtual talk
SWS Colloquium
Machine learning systems are deployed in diverse decision-making settings in service of stakeholders characterized by complex preferences. For example, in healthcare and finance, we ought to account for various levels of risk tolerance; and in personalized recommender systems, we face users whose preferences evolve dynamically over time. Building systems better aligned with stakeholder needs requires that we take the rich nature of human preferences into account. In this talk, I will give an overview of my research on the statistical and algorithmic foundations for building such human-centered machine learning systems. ...
Machine learning systems are deployed in diverse decision-making settings in service of stakeholders characterized by complex preferences. For example, in healthcare and finance, we ought to account for various levels of risk tolerance; and in personalized recommender systems, we face users whose preferences evolve dynamically over time. Building systems better aligned with stakeholder needs requires that we take the rich nature of human preferences into account. In this talk, I will give an overview of my research on the statistical and algorithmic foundations for building such human-centered machine learning systems. First, I will present a line of work that draws inspiration from the economics literature to develop learning algorithms that account for the risk preferences of stakeholders. Subsequently, I will discuss a line of work that draws insights from the psychology literature to develop online learning algorithms for personalized recommender systems that account for users’ evolving preferences. Please contact the office team for link information
Read more

Programmatic Interfaces for Design and Simulation

Aman Mathur Max Planck Institute for Software Systems
09 Feb 2023, 1:30 pm - 2:30 pm
Kaiserslautern building G26, room 111
SWS Student Defense Talks - Thesis Defense
Although Computer Aided Design (CAD) and simulation tools have been around for quite some time, modern design and prototyping pipelines are challenging the limits of these tools. Advances in 3D printing have brought manufacturing capability to the general public. Moreover, advancements in machine learning and sensor technology are enabling enthusiasts and small companies to develop their own autonomous vehicles and machines. This means that many more users are designing (or customizing) 3D objects in CAD, and many are testing autonomous machines in simulation. ...
Although Computer Aided Design (CAD) and simulation tools have been around for quite some time, modern design and prototyping pipelines are challenging the limits of these tools. Advances in 3D printing have brought manufacturing capability to the general public. Moreover, advancements in machine learning and sensor technology are enabling enthusiasts and small companies to develop their own autonomous vehicles and machines. This means that many more users are designing (or customizing) 3D objects in CAD, and many are testing autonomous machines in simulation. Though Graphical User Interfaces (GUIs) are the de-facto standard for these tools, we find that these interfaces are not robust and flexible. For example, designs made using GUIs often break when customized, and setting up large simulations is quite tedious in GUI. Though programmatic interfaces do not suffer from these limitations, they are generally quite difficult to use, and often do not provide appropriate abstractions and language constructs.

In this thesis, we present our work on combining the ease of use of GUI, with the robustness and flexibility of programming. For CAD, we propose an interactive framework that automatically synthesizes robust programs from GUI-based design operations. Additionally, we apply program analysis to ensure customizations do not lead to invalid objects. Finally, for simulation, we propose a novel programmatic framework that simplifies the building of complex test environments, and a test generation mechanism that guarantees good coverage over test parameters.
Read more

Toward Deep Semantic Understanding: Event-Centric Multimodal Knowledge Acquisition

Manling Li University of Illinois Urbana Champaign
01 Feb 2023, 3:00 pm - 4:00 pm
Saarbrücken building E1 5, room 002
simultaneous videocast to Kaiserslautern building G26, room 111
SWS Colloquium
Please note that this is a virtual talk which will be video casted to Saarbrücken and Kaiserslautern.

Traditionally, multimodal information consumption has been entity-centric with a focus on concrete concepts (such as objects, object types, physical relations, e.g., a person in a car), but lacks ability to understand abstract semantics (such as events and semantic roles of objects, e.g., driver, passenger, mechanic). However, such event-centric semantics are the core knowledge communicated, regardless whether in the form of text, ...
Please note that this is a virtual talk which will be video casted to Saarbrücken and Kaiserslautern.

Traditionally, multimodal information consumption has been entity-centric with a focus on concrete concepts (such as objects, object types, physical relations, e.g., a person in a car), but lacks ability to understand abstract semantics (such as events and semantic roles of objects, e.g., driver, passenger, mechanic). However, such event-centric semantics are the core knowledge communicated, regardless whether in the form of text, images, videos, or other data modalities.

At the core of my research in Multimodal Information Extraction (IE) is to bring such deep semantic understanding ability to the multimodal world. My work opens up a new research direction Event-Centric Multimodal Knowledge Acquisition to transform traditional entity-centric single-modal knowledge into event-centric multi-modal knowledge. Such a transformation poses two significant challenges: (1) understanding multimodal semantic structures that are abstract (such as events and semantic roles of objects): I will present my solution of zero-shot cross-modal transfer (CLIP-Event), which is the first to model event semantic structures for vision-language pretraining, and supports zero-shot multimodal event extraction for the first time; (2) understanding long-horizon temporal dynamics: I will introduce Event Graph Model, which empowers machines to capture complex timelines, intertwined relations and multiple alternative outcomes. I will also show its positive results on long-standing open problems, such as timeline generation, meeting summarization, and question answering. Such Event-Centric Multimodal Knowledge starts the next generation of information access, which allows us to effectively access historical scenarios and reason about the future. I will lay out how I plan to grow a deep semantic understanding of language world and vision world, moving from concrete to abstract, from static to dynamic, and ultimately from perception to cognition. Please contact the Office team for Zoom link information.
Read more

Terabyte-Scale Genome Analysis for Underfunded Labs

Sven Rahmann MMCI; CISPA Helmholtz Center for Information Security;
01 Feb 2023, 12:15 pm - 1:15 pm
Saarbrücken building E1 5, room 002
Joint Lecture Series
In 2023, you can get your personal genome sequenced for under 1000 Euros. If you do, you will obtain the data (50G to 100G) on a USB stick or hard drive. The data consists of many short strings (length 100 or 150) over the famous DNA alphabet {A,C,G,T}, for a total of 50 to 100 billion letters.

With this personal data, you may want to learn about your ancestry, or judge your personal risk of getting various conditions, ...
In 2023, you can get your personal genome sequenced for under 1000 Euros. If you do, you will obtain the data (50G to 100G) on a USB stick or hard drive. The data consists of many short strings (length 100 or 150) over the famous DNA alphabet {A,C,G,T}, for a total of 50 to 100 billion letters.

With this personal data, you may want to learn about your ancestry, or judge your personal risk of getting various conditions, such as myocardial infarction, stroke, breast cancer, and others. So you must look for certain (known) DNA variations in your individual genome that are known to be associated to certain population groups or diseases. As the raw data ("reads") consists of essentially random sub-strings of the genome, it is necessary to find the place of origin of each read in the genome, an error-tolerant pattern search task. In a medium-scale research study (say, for heart disease), we have similar tasks for a few hundred individual patients and healthy control persons, for a total of roughly 30-50 TB of data, delivered on a few USB hard drives. After primary analysis, the task is to find genetic variants (or, more coarsely, genes) related to the disease, i.e., we have a pattern mining problem with millions of features and a few hundred samples. The full workflow for such a study consists of more than 100_000 single steps, including simple per-sample steps (e.g., removing low-quality reads), and complex ones, involving statistical models across all samples for variant calling. Particularly in a medical setting, each step needs to be fully reproducible, we need to trace data provenance and maintain a chain of accountability. In the past ten years, we have worked and contributed to many aspects of variant-calling workflows and realized that the strategy to attack the ever-growing data with ever-growing compute clusters and storage systems will not scale well in the near future. Thus, our current work focuses on so-called alignment-free methods, which have the potential to yield the same answers as current state-of-the-art methods with 10 to 100 times less CPU work. I will present our recent advances in laying better foundations for alignment-free methods: engineered and optimized parallel hash tables for short DNA pieces (k-mers), and the design masks for gapped k-mers with optimal error tolerance. These new methods will enable even small labs to analyze large genomics datasets on a "good gaming PC", while investing less than 5000 Euros into computational hardware. I will also advertise our workflow language and execution engine "Snakemake", a combination of Make and Python that is now one of the most frequently used Bioinformatics workflow management tools, but actually not restricted to Bioinformatics research.
Read more