Events 2010 – MPI SWS

The Hadoop++ Project

Jens Dittrich Fachrichtung Informatik - Saarbruecken

10 Dec 2010, 11:00 am - 12:00 pm

Saarbrücken building E1 5, room 5th floor
simultaneous videocast to Kaiserslautern building G26, room 206

SWS Colloquium

The Hadoop++ Project

MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapReduce processes tasks in a scan-oriented fashion. Hence, the performance of Hadoop --- an open-source implementation of MapReduce --- often does not match the one of a well-configured parallel DBMS. ...

The Hadoop++ Project

MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapReduce processes tasks in a scan-oriented fashion. Hence, the performance of Hadoop --- an open-source implementation of MapReduce --- often does not match the one of a well-configured parallel DBMS. We propose a new type of system named Hadoop++: it boosts task performance without changing the Hadoop framework at all. To reach this goal, rather than changing a working system (Hadoop), we inject our technology at the right places through UDFs only and affect Hadoop from inside. This has three important consequences: First, Hadoop++ significantly outperforms Hadoop. Second, any future changes of Hadoop may directly be used with Hadoop++ without rewriting any glue code. Third, Hadoop++ does not need to change the Hadoop interface. Our experiments show the superiority of Hadoop++ over both Hadoop and HadoopDB for tasks related to indexing and join processing. In this talk I will present results from a VLDB 2010 paper as well as more recent work.

Link: http://infosys.cs.uni-saarland.de/hadoop++.php

Logic, Policy, and Federation in the Cloud

Yuri Gurevich Microsoft Research, Redmond

29 Nov 2010, 11:00 am - 12:00 pm

Saarbrücken building E1 4, room 024
simultaneous videocast to Kaiserslautern building G26, room 206

SWS Distinguished Lecture Series

Imagine that you manage a public cloud. You want to attract lucrative customers but they worry that their data will not be secure in your cloud. Of course they can encode their data before putting it in the cloud and decode it upon removal but that doesn't buy much for them (or for you because your cloud is used just as a glorified blob store). How can you add value? Cryptographers have many tricks but few of them are feasible at this point; ...

Imagine that you manage a public cloud. You want to attract lucrative customers but they worry that their data will not be secure in your cloud. Of course they can encode their data before putting it in the cloud and decode it upon removal but that doesn't buy much for them (or for you because your cloud is used just as a glorified blob store). How can you add value? Cryptographers have many tricks but few of them are feasible at this point; most notably, searching on encrypted data with a single keyword is being considered. But maybe we shouldn't reinvent the wheel. How do enterprises interact in real world? Consider commerce for example. Buyers and sellers from very different (in geography, culture, political system) countries succeed in making mutually beneficial deals. The sellers get paid, and the buyers get their goods. How does it work? Well, there is an involved support system that developed from centuries of experience: banks issuing letters of credit, insurance companies that underwrite the transactions and transportation, etc. And numerous policies are enforced. Similarly, there is an involved support system that allows Big Pharma to conduct clinical trials that straddle multiple countries. And so on. Can we lift such support systems to the cloud scale and make them more efficient in the process? We believe that the answer is YES. An important ingredient of the desired solution is a high-level language for writing policies. As we mentioned above, numerous policies need to be enforced. They also need to be stated formally to allow automation, and they need to be high-level to allow comprehension and reasoning. Cryptography is indispensible in enforcing policies but first we need a language to formulate policies succinctly and to exchange them among autonomous parties. The Distributed Knowledge Authorization Language (DKAL) was created for such purposes. It required foundational logic investigation, and it is in the process of tech transfer. This lecture is a popular introduction to DKAL and its applications to doing business via public clouds.

SMOOTHIE: Scalable transaction processing in the cloud

Nitin Gupta Cornell University

23 Nov 2010, 11:00 am - 12:00 pm

Saarbrücken building E1 5, room 5th floor
simultaneous videocast to Kaiserslautern building G26, room 206

SWS Colloquium

In this talk, I will introduce SMOOTHIE, a new framework for transaction processing. SMOOTHIE is based on optimistic concurrency control separated into a control flow of scalable components that can each scale. In addition, each of these components is nearly stateless and thus can be easily scaled up or down on-demand. I will present the architecture of SMOOTHIE, explain how it achieves scalability, present thoughts on how to avoid bottlenecks through heat-based data placement, and discuss tradeoffs between different implementations of the components of SMOOTHIE. ...

In this talk, I will introduce SMOOTHIE, a new framework for transaction processing. SMOOTHIE is based on optimistic concurrency control separated into a control flow of scalable components that can each scale. In addition, each of these components is nearly stateless and thus can be easily scaled up or down on-demand. I will present the architecture of SMOOTHIE, explain how it achieves scalability, present thoughts on how to avoid bottlenecks through heat-based data placement, and discuss tradeoffs between different implementations of the components of SMOOTHIE.

Entangled queries: an abstraction for declarative data-driven coordination.

Lucja Kot Cornell University

12 Nov 2010, 11:00 am - 12:30 pm

Kaiserslautern building G26, room 206
simultaneous videocast to Saarbrücken building E1 5, room 5th floor

SWS Colloquium

Today's Web 2.0 users increasingly want to perform data-driven tasks that involve collaboration and coordination. Friends want to make joint travel plans, students want to enroll in courses together, and busy professionals want to coordinate their schedules. These tasks are surprisingly difficult to support using existing database system abstractions, as the coordination required is in direct conflict with isolation -- a key property of ACID transactions.

In the Youtopia project at Cornell, we believe that data-driven coordination is so pervasive that it deserves dedicated support through a clean, ...

Today's Web 2.0 users increasingly want to perform data-driven tasks that involve collaboration and coordination. Friends want to make joint travel plans, students want to enroll in courses together, and busy professionals want to coordinate their schedules. These tasks are surprisingly difficult to support using existing database system abstractions, as the coordination required is in direct conflict with isolation -- a key property of ACID transactions.

In the Youtopia project at Cornell, we believe that data-driven coordination is so pervasive that it deserves dedicated support through a clean, declarative abstraction. It is time to move beyond isolation and support declarative, data-driven coordination (D3C) as a fundamental mode of data management. In this talk, I will introduce entangled queries - a simple yet powerful abstraction and mechanism to enable D3C. These queries allow users to specify the coordination required, such as "I want to travel on the same flight as my friend". At runtime, the system performs the coordination ensuring the specifications are met. I will discuss the syntax, semantics and evaluation of these queries, as well as introducing a range of broader research challenges associated with designing for D3C.

Making Broadband Access Networks Transparent to Researchers, Developers, and Users

Marcel Dischinger Max Planck Institute for Software Systems

27 Oct 2010, 4:30 pm - 5:30 pm

Saarbrücken building E1 4, room 024

SWS Student Defense Talks - Thesis Defense

Broadband networks are used by hundreds of millions of users to connect to the Internet today. However, most ISPs are hesitant to reveal details about their network deployments, and as a result the characteristics of broadband networks are often not known to users, developers, and researchers. In this thesis, we make progress towards mitigating this lack of transparency in broadband access networks in two ways.

First, using novel measurement tools we performed the first large-scale study of the characteristics of broadband networks. ...

Broadband networks are used by hundreds of millions of users to connect to the Internet today. However, most ISPs are hesitant to reveal details about their network deployments, and as a result the characteristics of broadband networks are often not known to users, developers, and researchers. In this thesis, we make progress towards mitigating this lack of transparency in broadband access networks in two ways.

First, using novel measurement tools we performed the first large-scale study of the characteristics of broadband networks. We found that broadband networks have very different characteristics than academic networks. We also developed Glasnost, a system that enables users to test their Internet access links for traffic differentiation. Glasnost has been used by more than 350,000 users worldwide and allowed us to study ISPs' traffic management practices. We found that ISPs increasingly throttle or even block traffic from popular applications such as BitTorrent.

Second, we developed two new approaches to enable realistic evaluation of networked systems in broadband networks. We developed Monarch, a tool that enables researchers to study and compare the performance of new and existing transport protocols at large scale in broadband environments. Furthermore, we designed SatelliteLab, a novel testbed that can easily add arbitrary end nodes, including broadband nodes and even smartphones, to existing testbeds like PlanetLab.

Practical memory safety for C

Periklis Akritidis University of Cambridge

06 Sep 2010, 11:00 am - 12:00 pm

Kaiserslautern building G26, room 206
simultaneous videocast to Saarbrücken building E1 5, room 5th floor

SWS Colloquium

C facilitates high performance execution and low-level systems programming, but the lack of memory safety undermines security and reliability; for example, memory corruption bugs can breach security, and faults in kernel extensions can bring down the entire operating system. Memory safe languages, however, are unlikely to displace C in the near future, and solutions currently in use offer inadequate protection. Comprehensive proposals, on the other hand, are either too slow for practical use, or break backwards compatibility by requiring source code porting or generating incompatible binary code. ...

C facilitates high performance execution and low-level systems programming, but the lack of memory safety undermines security and reliability; for example, memory corruption bugs can breach security, and faults in kernel extensions can bring down the entire operating system. Memory safe languages, however, are unlikely to displace C in the near future, and solutions currently in use offer inadequate protection. Comprehensive proposals, on the other hand, are either too slow for practical use, or break backwards compatibility by requiring source code porting or generating incompatible binary code. My talk will present backwards-compatible solutions to prevent dangerous memory corruption in C programs at a low cost.

Effective scheduling techniques for high-level parallel-programming languages

Mike Rainey University of Chicago

17 Aug 2010, 11:00 am - 12:00 pm

Kaiserslautern building G26, room 206
simultaneous videocast to Saarbrücken building E1 5, room 5th floor

SWS Colloquium

In the not-so-distant past, parallel programming was mostly the concern of programmers specializing in high-performance computing. Nowadays, on the other hand, many of today's desktop and laptop computers come equipped with a species of shared-memory multiprocessor called a multicore processor, making parallel programming a concern for a much broader range of programmers. High-level parallel languages, such as Parallel ML (PML) and Haskell, seek to reduce the complexity of programming multicore processors by giving programmers abstract execution models, ...

In the not-so-distant past, parallel programming was mostly the concern of programmers specializing in high-performance computing. Nowadays, on the other hand, many of today's desktop and laptop computers come equipped with a species of shared-memory multiprocessor called a multicore processor, making parallel programming a concern for a much broader range of programmers. High-level parallel languages, such as Parallel ML (PML) and Haskell, seek to reduce the complexity of programming multicore processors by giving programmers abstract execution models, such as implicit threading, where programmers annotate their programs to suggest the parallel decomposition. Implicitly-threaded programs, however, do not specify the actual decomposition of computations or mapping from computations to processors. The annotations act simply as hints that can be ignored and safely replaced with sequential counterparts. The parallel decomposition itself is the responsibility of the language implementation and, more specifically, of the scheduling system.

Threads can take arbitrarily different amounts of time to execute, and these times are difficult to predict. Implicit threading encourages the programmer to divide the program into threads that are as small as possible because doing so increases the flexibility the scheduler in its duty to distribute work evenly across processors. The downside of such fine-grain parallelism is that if the total scheduling cost is too large, then parallelism is not worthwhile. This problem is the focus of this talk.

The starting point of this talk is work stealing, a scheduling policy well known for its scalable parallel performance, and the work-first principle, which serves as a guide for building efficient implementations of work stealing. In this talk, I present two techniques, Lazy Promotion and Lazy Tree Splitting, for implementing work stealing. Both techniques derive their efficiency from adhering to the work-first principle. Lazy Promotion is a strategy that improves the performance, in terms of execution time, of a work-stealing scheduler by reducing the amount of load the scheduler places on the garbage collector. Lazy Tree Splitting is a technique for automatically scheduling the execution of parallel operations over trees to yield scalable performance and eliminate the need for per-application tuning. I use Manticore, PML's compiler and runtime system, and a sixteen-core NUMA machine as a testbed for these techniques.

In addition, I present two empirical studies. In the first study, I evaluate Lazy Promotion over six PML benchmarks. The results demonstrate that Lazy Promotion either outperforms or performs the same as an alternative scheme based on Eager Promotion. This study also evaluates the design of the Manticore runtime system, in particular, the split-heap memory manager, by comparing the system to an alternative system based on a unified-heap memory manager, and showing that the unified version has limited scalability due to poor locality. In the second study, I evaluate Lazy Tree Splitting over seven PML benchmarks by comparing Lazy Tree Splitting to its alternative, Eager Tree Splitting. The results show that, although the two techniques offer similar scalability, only Lazy Tree Splitting is suitable for building an effective language implementation.

Solvers for Software Reliability and Security

Vijay Ganesh MIT

22 Jul 2010, 1:00 pm - 2:30 pm

Kaiserslautern building G26, room 206
simultaneous videocast to Saarbrücken building E1 5, room 5th floor

SWS Colloquium

The task of building reliable and secure software remains one of the most important and challenging issues in computer science. In recent years, there has been rapid progress in the scalability and effectiveness of software reliability tools. A key reason for this success is the dramatic improvement in the speed of constraint solvers over the last decade. Constraint solvers are essential components of most software reliability tools, whether they are based on formal methods, program analysis, ...

The task of building reliable and secure software remains one of the most important and challenging issues in computer science. In recent years, there has been rapid progress in the scalability and effectiveness of software reliability tools. A key reason for this success is the dramatic improvement in the speed of constraint solvers over the last decade. Constraint solvers are essential components of most software reliability tools, whether they are based on formal methods, program analysis, testing or synthesis. My research on constraint solvers has directly contributed to this trend of increasing solver efficiency and expressive power, thus advancing the state-of-the-art in software reliability research.

In this talk, I will present two solvers that I have designed and implemented, namely, STP and HAMPI. I will talk about the techniques that enable STP and HAMPI to scale, and also some theoretical results. I will also talk about the contexts and applications where each solver is best suited.

STP is a solver for the theory of bit-vectors and arrays. STP was one of the first constraint solvers to enable an exciting new testing technique called Dynamic Systematic Testing (aka Concolic Testing). STP-enabled concolic testers have been used to find hundreds of previously unknown bugs in Unix utilities, OS kernels, media players, and commercial software, some with approximately a million lines of code.

Next, I will describe HAMPI, a solver for a rich theory of equality over bounded string variables, bounded regular expressions, and context-free grammars. Constraints in this theory are generated by analysis of string-manipulating programs. HAMPI has been used to find many unknown SQL injection vulnerabilities in applications with more than 100,000 lines of PHP code using static and dynamic analysis.

Finally, I will conclude my talk with two future research programs. First, I will discuss how faster solvers can enable qualitatively novel approaches to software reliability. Second, I will discuss how we can go from specific solver techniques to solver design paradigms for rich logics.

Timing Analysis of Mixed Time / Event-Triggered Multi-Mode Systems

Linh Phan University of Pensylvania

19 Jul 2010, 1:00 pm - 2:30 pm

Kaiserslautern building G26, room 206
simultaneous videocast to Saarbrücken building E1 5, room 5th floor

SWS Colloquium

Many embedded systems operate in multiple modes, where mode switches can be both time- as well as event-triggered. While timing and schedulability analysis of the system while it is operating in a single mode has been well studied, it is always difficult to piece together the results from different modes in order to deduce the timing properties of a multi-mode system. In this talk, I will present a model and associated analysis techniques to describe embedded systems that process multiple bursty/complex event/data streams and in which mode changes are both time- and event-triggered. ...

Many embedded systems operate in multiple modes, where mode switches can be both time- as well as event-triggered. While timing and schedulability analysis of the system while it is operating in a single mode has been well studied, it is always difficult to piece together the results from different modes in order to deduce the timing properties of a multi-mode system. In this talk, I will present a model and associated analysis techniques to describe embedded systems that process multiple bursty/complex event/data streams and in which mode changes are both time- and event-triggered. Compared to previous studies, our model is very general and can capture a wide variety of real-life systems. Our analysis techniques can be used to determine different performance metrics, such as the maximum fill-levels of different buffers and the delays suffered by the streams being processed by the system. The main novelty in our analysis lies in how we piece together results from the different modes in order to obtain performance metrics for the full system.

Approximate Inference Algorithms in Markov Random Field with Their Applications

Kyomin Jung KAIST

19 Jul 2010, 11:00 am - 12:30 pm

Saarbrücken building E1 5, room 5th floor
simultaneous videocast to Kaiserslautern building G26, room 206

SWS Colloquium

Markov Random Field (MRF) provides an elegant and succinct abstraction to capture inter-dependency between a large number of random variables - applications abound in communication networks, signal processing, statistical physics, combinatorics, biology, etc. In most of these applications, the key task pertains inferring the most likely assignment (aka MAP). This problem is computationally hard in general. Therefore, various approximations have been developed that utilize graph structure of the MRF for efficient computation. Popular approaches like Belief Propagation (and its variants) work well when the underlying graph has large girth, ...

Markov Random Field (MRF) provides an elegant and succinct abstraction to capture inter-dependency between a large number of random variables - applications abound in communication networks, signal processing, statistical physics, combinatorics, biology, etc. In most of these applications, the key task pertains inferring the most likely assignment (aka MAP). This problem is computationally hard in general. Therefore, various approximations have been developed that utilize graph structure of the MRF for efficient computation. Popular approaches like Belief Propagation (and its variants) work well when the underlying graph has large girth, eg. sparse graphical codes like LDPC codes.

In many applications of interest, graphs do have lots of short cycles, but they naturally posses some ``geometry''. We develop a new class of approximation algorithms that utilize geometry, which is called polynomially growing, of the underlying graph to obtain efficient approximation with provable guarantee for the inference problem.

In this talk, I will describe the main idea of the Belief Propagation, and our new algorithm based on simple local updates. I will describe their applications to wireless network and image processing.

Spoken Networks: Analyzing face-to-face conversations and how they shape our social connections

Tanzeem Choudhury Dartmouth College

25 Jun 2010, 1:30 pm - 2:30 pm

Kaiserslautern building G26, room 206

SWS Colloquium

With the proliferation of sensor-rich mobile devices, it is now possible to collect data that continuously capture the real-world social interactions of entire groups of people. These new data sets provide opportunities to study the social networks of people as they are observed "in the wild." However, traditional methods often model social networks as static and binary, which are inadequate for continuous behavioral data. Networks derived from behavioral data are almost always temporal, are often non-stationary, ...

With the proliferation of sensor-rich mobile devices, it is now possible to collect data that continuously capture the real-world social interactions of entire groups of people. These new data sets provide opportunities to study the social networks of people as they are observed "in the wild." However, traditional methods often model social networks as static and binary, which are inadequate for continuous behavioral data. Networks derived from behavioral data are almost always temporal, are often non-stationary, and have finer grained observations about interactions as opposed to simple binary indicators. Thus, new techniques are needed that can take into account the variable tie intensities and the dynamics of a network as it evolves in time. In this talk, I will provide an overview of the computational framework we have developed for modeling the micro-level dynamics of human interactions as well as the macro-level network structure and its dynamics from local, noisy sensor observations. Furthermore, by studying the micro and macro levels simultaneously we are able to link dyad-level interaction dynamics (local behavior) to network-level prominence (a global property). I will conclude by providing some specific examples of how the methods we have developed can be applied more broadly to better understand and enhance the lives and health of people.

Based on joint work with Danny Wyatt (University of Washington), James Kitts (Columbia), Jeff Bilmes (University of Washington), Andrew Campbell (Dartmouth), and Ethan Berke (Dartmouth Medical School)

Privacy and forensics in federated distributed systems

Andreas Haeberlen University of Pennsylvania

15 Jun 2010, 11:00 am - 12:00 pm

Saarbrücken building E1 5, room 5th floor
simultaneous videocast to Kaiserslautern building G26, room 206

SWS Colloquium

In this talk, I will describe two of my ongoing projects, one on collaborative security and the other on secure network provenance.

Collaborative security is an approach to fighting security threats that span many different administrative domains. For example, it is often difficult for Internet service providers to detect botnets because each domain can locally observe only a small part of the botnet; however, exchanging information across domains is challenging because of privacy concerns. ...

In this talk, I will describe two of my ongoing projects, one on collaborative security and the other on secure network provenance.

Collaborative security is an approach to fighting security threats that span many different administrative domains. For example, it is often difficult for Internet service providers to detect botnets because each domain can locally observe only a small part of the botnet; however, exchanging information across domains is challenging because of privacy concerns. We are working on a system that enables domains to share information in a controlled fashion. Each domain establishes a 'privacy budget' and allows other domains to ask queries in a special language; the 'privacy cost' of each query is inferred automatically and deducted from the budget. This is accomplished through a combination of type systems and differential privacy.

Secure network provenance (SNP) is a technique for answering forensic questions about distributed systems in which some nodes have been compromised by an adversary. Each node maintains a record of past states and the (local or remote) causes of state transitions. Thus, if the symptoms of an intrusion are observed, it is possible to trace them back to their root causes or, conversely, to determine how an intrusion on one node has affected other nodes.

Reverse Traceroute

Arvind Krishnamurthy University of Washington

14 Jun 2010, 10:15 am - 11:15 am

Saarbrücken building E1 5, room 5th floor
simultaneous videocast to Kaiserslautern building G26, room 206

SWS Colloquium

Traceroute is the most widely used Internet diagnostic tool today. Network operators use it to help identify routing failures, poor performance, and router misconfigurations. Researchers use it to map the Internet, predict performance, geolocate routers, and classify the performance of ISPs. However, traceroute has a fundamental limitation that affects all these applications: it does not provide reverse path information. Although various public traceroute servers across the Internet provide some visibility, no general method exists for determining a reverse path from an arbitrary destination. ...

Traceroute is the most widely used Internet diagnostic tool today. Network operators use it to help identify routing failures, poor performance, and router misconfigurations. Researchers use it to map the Internet, predict performance, geolocate routers, and classify the performance of ISPs. However, traceroute has a fundamental limitation that affects all these applications: it does not provide reverse path information. Although various public traceroute servers across the Internet provide some visibility, no general method exists for determining a reverse path from an arbitrary destination.

We address this longstanding limitation by building a reverse traceroute tool. Our tool provides the same information as traceroute, but for the reverse path, and it works in the same case as traceroute, when the user may lack control of the destination. Our approach combines a number of ideas: source spoofing, IP timestamp and record route options, and multiple vantage points. In the median case our tool finds 87% of the hops seen in a directly measured traceroute along the same path, versus only 38% if one simply assumes the path is symmetric, a common fallback given the lack of available tools. We then use our reverse traceroute tool to study previously unmeasurable aspects of the Internet: we uncover more than a thousand peer-to-peer AS links invisible to current topology mapping efforts, we examine a case study of how a content provider could use our tool to troubleshoot poor path performance, and we measure the latency of individual backbone links with, on average, sub-millisecond precision

Exploiting Language Abstraction to Optimize Memory Efficiency

Jennifer B. Sartor University of Texas, Austin

12 May 2010, 2:00 pm

Saarbrücken building E1 5, room 5th floor
simultaneous videocast to Kaiserslautern building G26, room 206

SWS Colloquium

Memory continues to be a bottleneck in modern systems as application complexity and the number of cores have increased, generating more traffic. These complex applications are written largely in high-level managed languages, which offer an opportunity to dynamically optimize memory performance because they virtualize memory management. We explored memory inefficiencies in a Java virtual machine, performing a study of data compression techniques. I will present the results of that study, showing arrays are a dominant source of heap bloat. ...

Memory continues to be a bottleneck in modern systems as application complexity and the number of cores have increased, generating more traffic. These complex applications are written largely in high-level managed languages, which offer an opportunity to dynamically optimize memory performance because they virtualize memory management. We explored memory inefficiencies in a Java virtual machine, performing a study of data compression techniques. I will present the results of that study, showing arrays are a dominant source of heap bloat. Focusing on arrays, we found the traditional contiguous layout precludes space optimizations and does not offer memory management time and space bounds. I show how to exploit the opportunities afforded by managed languages to implement an efficient discontiguous array layout with tunable optimization parameters that improve space usage. Having attacked memory performance on the software side, I will then describe new work that takes a cooperative software-hardware approach. I show how a memory manager can communicate regions of dead data to the architecture, allowing it to eliminate useless writes, substantially reducing memory write traffic. My research combines the flexibility and productivity of high-level managed languages with improved memory efficiency that is critical to current and future hardware.

Programmable Self-Adjusting Computation

Ruy Ley-Wild Carnegie Mellon University

27 Apr 2010, 2:00 pm - 3:00 pm

Saarbrücken building E1 5, room 5th floor
simultaneous videocast to Kaiserslautern building G26, room 206

SWS Colloquium

In the self-adjusting computation model, programs can respond automatically and efficiently to input changes by tracking the dynamic data dependencies of the computation and incrementally updating the output as needed. After a run from scratch, the input can be changed and the output can be updated via change-propagation, a mechanism for re-executing the portions of the computation affected by the changes while reusing the unaffected parts. Previous research shows that self-adjusting computation can be effective at achieving near-optimal update bounds for various applications. ...

In the self-adjusting computation model, programs can respond automatically and efficiently to input changes by tracking the dynamic data dependencies of the computation and incrementally updating the output as needed. After a run from scratch, the input can be changed and the output can be updated via change-propagation, a mechanism for re-executing the portions of the computation affected by the changes while reusing the unaffected parts. Previous research shows that self-adjusting computation can be effective at achieving near-optimal update bounds for various applications. We address the question of how to write and reason about self-adjusting programs.

We propose a language-based technique for annotating ordinary programs and compiling them into equivalent self-adjusting versions. We also provide a cost semantics and a concept of trace distance that enables reasoning about the effectiveness of self-adjusting computation at the source level. To facilitate asymptotic analysis, we propose techniques for composing and generalizing concrete distances via trace contexts (traces with holes). The translation preserves the extensional semantics of the source programs, the intensional cost of from-scratch runs, and ensures that change-propagation between two evaluations takes time bounded by their relative distance.

Accurate Analysis of Large Private Datasets

Vibhor Rastogi University of Washington

15 Apr 2010, 10:00 am - 11:30 am

Saarbrücken building E1 5, room 5th floor
simultaneous videocast to Kaiserslautern building G26, room 206

SWS Colloquium

Today, no individual has full control over access to his personal information. Private data collected by various hospitals and universities, and also by websites like Google and Facebook, contain valuable statistical facts that can be mined for research and analysis, e.g., analyze outbreak of diseases, detect traffic patterns on the road, or understand browsing trends on the web, but concerns about individual privacy severely restricts its use, e.g., privacy attacks led AOL to recently pull-off its published search-log data. ...

Today, no individual has full control over access to his personal information. Private data collected by various hospitals and universities, and also by websites like Google and Facebook, contain valuable statistical facts that can be mined for research and analysis, e.g., analyze outbreak of diseases, detect traffic patterns on the road, or understand browsing trends on the web, but concerns about individual privacy severely restricts its use, e.g., privacy attacks led AOL to recently pull-off its published search-log data.

To remedy this, much recent work focuses on data analysis with formal privacy guarantees. This has given rise to differential privacy considered by many as the golden standard of privacy. However, few practical techniques satisfying differential privacy exist for complex analysis tasks (e.g., analysis involving complex query operators), or new data models (e.g., data having temporal correlations or uncertainty). In this talk, I will discuss techniques that fill this void. I will first discuss a query answering algorithm that can handle joins (previously, no private technique could accurately answer join queries arising in many analysis tasks). This algorithm makes several privacy-preserving analyses over social network graphs possible for the first time. Then I will discuss a query-answering technique over time-series data, which enables private analysis of GPS traces and other temporally-correlated data. Third, I will discuss an access control mechanism for uncertain data, which enables enforcing security policies on RFID-based location data. Finally, I will conclude by discussing some privacy and security problems in building next-generation computing systems based on new models for data (e.g., uncertain data), computing (e.g., cloud computing), and human computer interaction (e.g., ubiquitous systems).

Binary Program Analysis and Model Extraction for Security Applications

Juan Caballero Carnegie Mellon University

01 Apr 2010, 11:00 am - 12:30 pm

Saarbrücken building E1 5, room 5th floor
simultaneous videocast to Kaiserslautern building G26, room 206

SWS Colloquium

In this talk I present a platform to extract models of security-relevant functionality from program binaries, enabling multiple security applications such as active botnet infiltration, finding deviations between implementations of the same functionality, vulnerability signature generation, and finding content-sniffing cross-site scripting (XSS) attacks. In this talk, I present two applications: active botnet infiltration and finding content-sniffing XSS attacks.

Botnets, large networks of infected computers under control of an attacker, are one of the dominant threats in the Internet, ...

In this talk I present a platform to extract models of security-relevant functionality from program binaries, enabling multiple security applications such as active botnet infiltration, finding deviations between implementations of the same functionality, vulnerability signature generation, and finding content-sniffing cross-site scripting (XSS) attacks. In this talk, I present two applications: active botnet infiltration and finding content-sniffing XSS attacks.

Botnets, large networks of infected computers under control of an attacker, are one of the dominant threats in the Internet, enabling fraudulent activities such as spamming, phishing, and distributed denial-of-service attacks. To build strong botnet defenses, defenders need information about the botnet's capabilities and the attacker's actions. One effective way to obtain that information is through active botnet infiltration, but such infiltration is challenging due to the encrypted and proprietary protocols that botnets use to communicate. In this talk, I describe techniques for reverse-engineering such protocols and present how we use this information to infiltrate a prevalent, previously not analyzed, spam botnet.

Cross-site scripting attacks are the most prevalent class of attacks nowadays. One subtle class of overlooked XSS attacks are content-sniffing XSS attacks. In this talk, I present model extraction techniques and how they enable finding content-sniffing XSS attacks. We use those models to find attacks against popular web sites and browsers such as Wikipedia when accessed using Internet Explorer 7. I describe our defenses for these attacks and how our proposals have been adopted by widely used browsers such as Google Chrome and IE8, as well as standardization groups.

Taming the Malicious Web: Avoiding and Detecting Web-based Attacks

Marco Cova University of California, Santa Barbara

29 Mar 2010, 11:00 am - 12:30 pm

Saarbrücken building E1 5, room 5th floor
simultaneous videocast to Kaiserslautern building G26, room 206

SWS Colloquium

The world wide web is an essential part of our infrastructure and a predominant mean for people to interact, do business, and participate to democratic processes. Unfortunately, in recent years, the web has also become a more dangerous place. In fact, web-based attacks are now a prevalent and serious threat. These attacks target both web applications, which store sensitive data (such as financial and personal records) and are trusted by large user bases, and web clients, ...

The world wide web is an essential part of our infrastructure and a predominant mean for people to interact, do business, and participate to democratic processes. Unfortunately, in recent years, the web has also become a more dangerous place. In fact, web-based attacks are now a prevalent and serious threat. These attacks target both web applications, which store sensitive data (such as financial and personal records) and are trusted by large user bases, and web clients, which, after a compromise, can be mined for private data or used as drones of a botnet.

In this talk, we will present an overview of our techniques to detect, analyze, and mitigate malicious activity on the web. In particular, I will present a system, called Wepawet, which targets the problem of detecting web pages that launch drive-by-download attacks against their visitors. Wepawet visits web pages with an instrumented browser and records events that occur during the interpretation of their HTML and JavaScript code. This observed activity is analyzed using anomaly detection techniques to classify web pages as benign or malicious. We made our tool available as an online service, which is currently used by several thousands of users every month.

We will also discuss techniques to automatically detect vulnerabilities and attacks against web applications. In particular, we will focus on static analysis techniques to identify ineffective sanitization routines and to detect vulnerabilities stemming from the interaction of multiple modules of a web application. These techniques found tens of vulnerabilities in several real-world web applications.

Formal Program Verification Through Characteristic Formulae

Arthur Chargueraud INRIA

25 Mar 2010, 11:00 am - 12:30 pm

Saarbrücken building E1 5, room 5th floor
simultaneous videocast to Kaiserslautern building G26, room 206

SWS Colloquium

The characteristic formula of a program is a logical formula that implies any valid post-condition for that program. In this talk, I will explain how to build, in a systematic manner, the characteristic formula of a given purely functional Caml program, and then explain how one can use this formula to verify the program interactively, using the Coq proof assistant. This new, sound and complete approach to the verification of total correctness properties has been applied to the formalization of a number of data structures taken from Chris Okasaki's reference book. ...

The characteristic formula of a program is a logical formula that implies any valid post-condition for that program. In this talk, I will explain how to build, in a systematic manner, the characteristic formula of a given purely functional Caml program, and then explain how one can use this formula to verify the program interactively, using the Coq proof assistant. This new, sound and complete approach to the verification of total correctness properties has been applied to the formalization of a number of data structures taken from Chris Okasaki's reference book. My presentation will include demos based on those case studies.

Program equivalence and compositional compiler correctness

Chung-Kil Hur Laboratoire PPS

22 Mar 2010, 11:00 am - 12:00 pm

Saarbrücken building E1 5, room 5th floor
simultaneous videocast to Kaiserslautern building G26, room 206

SWS Colloquium

Marco Shapiro and Pierre Sutra LIP6

11 Jan 2010, 3:00 pm - 4:30 pm

Saarbrücken building E1 5, room 5th floor
simultaneous videocast to Kaiserslautern building G26, room 206

SWS Colloquium

Experience with many applications (e.g., web caching, P2P sharing, and grid and cloud computing) shows that data replication is a fundamental feature of large systems. Replication improves performance, availability and dependability.

Replication algorithms based on state machine replication are attractive because they maintain a simple sequential semantics. However, we believe they will not scale to massive cloud and peer-to-peer systems. To be successful, future algorithms must: (1) support multi-object transactions across distant data centres, ...

Experience with many applications (e.g., web caching, P2P sharing, and grid and cloud computing) shows that data replication is a fundamental feature of large systems. Replication improves performance, availability and dependability.

Replication algorithms based on state machine replication are attractive because they maintain a simple sequential semantics. However, we believe they will not scale to massive cloud and peer-to-peer systems. To be successful, future algorithms must: (1) support multi-object transactions across distant data centres, (2) leverage the semantics of data accesses, and (3) support partial replication.

In the first part of this talk, we describe two variants of Generalized Paxos, a solution to consensus that leverages the commutativity semantics. Our algorithms reduce message delay when a collision occurs between non-commuting operations. In the second part, we present a new approach to partial replication of database systems at large scale. Previous protocols either reexecute transactions entirely and/or compute a total order of transactions. In contrast, ours applies update values, and generate a partial order between mutually conflicting transactions only.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Need help?