Report on ACM Eurosys 2013 Conference

Technical University Library

This year Eurosys took place in Prague, Czech Republic. Eurosys is one of the main systems conferences. This year there were 28 accepted papers out of 143 (19% acceptance rate), a bit higher than previous years, but statistics did not consider submissions that did not fulfill the requirements, which has been counted in the past. Out of them only 6 european, and plenty of MSR papers.

Monday 15th April

Session 1: Large scale distributed computation I

TimeStream: Reliable Stream Computation in the Cloud

Chengping Qian (Microsoft Research Asia), Yong He (South China University of Technology), Chunzhi Su, Zhuojie Wu, and Hongyu Zhu (Shanghai Jiaotong University), Taizhi Zhang (Peking University), Lidong Zhou (Microsoft Research Asia), Yuan Yu (Microsoft Research Silicon Valley), and Zheng Zhang (Microsoft Research Asia)
Good motivation from MSR about why use stream processing: real-time heat map of latency pairwise in datacenter for network/ Infrastructure monitoring, real-time advertising, map queries, both current and previous ones, to the presented adverts.

Adverts must be reliable! more than monitoring (makes sense :) ).

Contribution focusses on resilience and fault tolerance. They build a DAG, rewritten dynamically, replacing link with hashed ones with several nodes. Not losing info, any subgraph can be substituted by an equivalent one, and reloads/recomputes missing pieces.

Several optimizations added, such as message batch aggregation, and lightweight dependency tracking from input/output, to estimate impact

Interesting work, and was well presented.

Optimus: A Dynamic Rewriting Framework for Execution Plans of Data-Parallel Computation

Qifa Ke, Michael Isard, and Yuan Yu (Microsoft Research Silicon Valley)

Motivation: there are many problems in large scale computations which cannot be known in advance. How to handle partition skew, what is the right number of tasks (e.g. Reducers?). Also, large scale matrix multiplication is argued it can cause problems with intermediate steps. Iterative co, or providing fault tolerance capabilities? The paper proposes to optimize the EPG (Execution Plan Graph) at runtime to solve these issues.

The solution for reliability is interesting,having a ‘cache’ for intermediate data, and choosing either data for next step.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data


Sameer Agarwal (University of California, Berkeley), Barzan Mozafari (Massachusetts Institute of Technology), Aurojit Panda (University of California, Berkeley), Henry Milner (University of California, Berkeley), Samuel Madden (Massachusetts Institute of Technology), and Ion Stoica (University of California, Berkeley)

Motivation: compute aggregate statistics on huge amounts of data.

Select WITHIN 2 seconds. Results with error, can be refined . The idea is very elegant and original, and was brilliantly presented.

Estimation time is obtained by querying over small samples and extrapolating (should be linear). So, how well does a query cover the original one? Depends on what elements it has. Computes how complete the data will be, don’t get it too well. A sample has cost and coverage. ILP problem determines what samples to get.

Session 2: Security and Privacy

IFDB: Decentralized Information Flow Control for Databases

David Schultz and Barbara Liskov (MIT CSAIL)

Information flow control, by tagging database rows with labels, stored in an extra column. It’s far from my area, but I am surprised this has not been done already. Paper is well explained, and results show that the incurred overhead is small. A problem pointed in the questions is that it requires manual tagging, which is in general extremely hard to do right.

Process Firewalls: Protecting Processes During Resource Access

 Hayawardh Vijayakumar (The Pennsylvania State University), Joshua Schiffman (Advanced Micro Devices), and Trent Jaeger (The Pennsylvania State University)

There are many threats to file access, resource access cntrol is very hard. Programmer are not going to get it right. System call level protection carries a huge overhead.  Idea: reverse threat protection approach. With introspection you protect vulnerable processes, instead of sand boxing dangerous attackers. So, declare unsafe resources for a specific process context.

It is interesting that processing firewall rules is much more efficient that implementing checks manually (because it is prone to errors). Declarative wins apparently by a huge margin.

Resolving the conflict between generality and plausibility in verified computation

Srinath Setty, Benjamin Braun, Victor Vu, and Andrew J. Blumberg (UT Austin), Bryan Parno (Microsoft Research Redmond), and Michael Walfish (UT Austin)

They propose a Cryptographic technique, Probabilistically Checkable Proof (PCP), for checking whether a server indeed performed some computation. It is not yet usable because of the huge computation cost.

Session 3: Replication

ChainReaction: A Causal+ Consistent Datastore based on Chain Replication

Sergio Almeida, Joao Leitao, and Luıs Rodrigues (INESC-ID, Instituto Superior Tecnico, Universidade Tecnica de Lisboa)

ChainReaction is a Geo-distributed K/V store that implements the existing Causal+ model for improved read performance over existing causal replication-based systems. The work extends over FAWN, and adds capabilities for a geo-distributed setup. It was well presented, and design decisions and results are clearly reflected in the paper.

Augustus: Scalable and Robust Storage for Cloud Applications

Ricardo Padilha and Fernando Pedone (University of Lugano, Switzerland)

Bizantine Failure Tolerance (BFT) would be convenient in a cloud environment, but it heavily penalizes latency. The paper proposes single-partition transactions, and multi-partition read only transactions. The restrictions in applicability look a bit severe, but it was nicely validated across different workloads. Not sure that the social network workload they generated was representative, with them choosing an arbitrary 50% chance of a connection being close, with no justification.

MDCC: Multi-Data Center Consistency

Tim Kraska, Gene Pang, and Michael Franklin (UC Berkeley), Samuel Madden (MIT), and Alan Fekete (University of Sydney)

The authors present MDCC, a replication technique that attempts to exploit two main observations of geo distributed DBs: conflicting operations are commutative, and they are actually rare, as each client often updates their own data. With these, they implement a modified version of Paxos Multi + Fast, which attempts to lessen latency by reducing in several cases the number of phases. Results point in an extensive set of experiments to a significant performance improvement over other transactional databases.

Session 4: Concurrency and Parallelism

Conversion: Multi-Version Concurrency Control for Main Memory Segments 
Timothy Merrifield and Jakob Eriksson (University of Illinois at Chicago)

Cache control has become a main bottleneck in multi-core systems. Proposal: each process handles its own working copy for concurrent memory access. If processes can afford working with a slightly out of date copy, performance can be significantly improved.

Whose Cache Line Is It Anyway? Operating System Support for Live Detection and Repair of False Sharing 
Mihir Nanavati, Mark Spear, Nathan Taylor, Shriram Rajagopalan, Dutch T. Meyer, William Aiello, and Andrew Warfield (University of British Columbia)

Writes to the same cache line from multiple processes force to write everyone to main memory. Can have a huge impact on performance in many cases. Idea: split pages in an isolated page where conflicts are, and an underlay page with no conflicts.

Adaptive Parallelism for Web Search 
Myeongjae Jeon (Rice University), Yuxiong He (Microsoft Research), Sameh Elnikety (Microsoft Research), Alan L. Cox and Scott Rixner (Rice University)

In web search services (e.g. Bing), parallelism involves querying multiple index servers for results, and aggregating them with techniques such as PageRank. However, index server queries are sequential. The paper discusses the challenges of parallelizing in-server search. I don’t think it is novel, but it is well explained.

Tuesday 16th April

Session 1: Large scale distributed computation II

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Zuhair Khayyat, Karim Awara, and Amani Alonazi (King Abdullah University of Science and Technology), Hani Jamjoom and Dan Williams (IBM T. J. Watson Research Center, Yorktown Heights), and Panos Kalnis (King Abdullah University of Science and Technology)

MIzan is a Pregel-based system, implemented in C++, which optimizes execution time by dynamically migrating vertices every iteration. Each node superstep execution is profiled, so that they statistically see what nodes perform slower, and if over threshold migrate dynamically. Every worker has a match worker with less load, where all migrations are headed too. Node locations are handled through a DHT. The technique is interesting in the sense that it completely ignores graph topology. It might incidentally reduce the number of cut edges, but does it by only looking at runtime statistics. However, the destination is chosen to load balance CPU, so if the network is the bottleneck it might not be enough.

The solution works, although validation has some problems, they only looked at graphs in the 2M range, which lend to questions about why not do it in a single machine. Comparisons where against a specific version of Giraph, which Greg Malewicz pointed out it had been vastly improved in the last six months.

MeT: Workload aware elasticity for NoSQL

Francisco Cruz, Francisco Maia, Miguel Matos, Rui Oliveira, Joao Paulo, Jose Pereira, and Ricardo Vilaca (HASLab / INESC TEC and U. Minho)

MeT is an HBase extension that performs elastic configuration of slaves. Depending on the observed access patterns, it provides replication and load balancing, Monitoring runtime statistics feeds a decision algorithm, identifying suboptimal config from CPU Usage. If a problem is detected, a distribution algorithm is run to spread the replicas.

Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices

Shivaram Venkataraman (UC Berkeley), Erik Bodzsar (University of Chicago), and Indrajit Roy, Alvin AuYoung, and Robert S. Schreiber (HP Labs)

Presto is an R library that parallelizes matrix computations using distributed machines (darray construct, supporting for each). The API is lightweight, and requires minimal modification of R code. When paralellizing sparse matrices, Presto attempts to avoid the impact of skew in sparse matrices partitions.

Online repartitioning scheme, profiling each partition for optimizing further iteratons, if ratio higher than threshold, split problematic partition. It is integrated to R without modifying it by  hacking memory allocation including object headers. The contribution is not major, but is is well thought and described.

Session 2: Operating Systems Implementation

RadixVM: Scalable address spaces for multithreaded applications

Austin T. Clements, Frans Kaashoek, and Nickolai Zeldovich (MIT CSAIL)

Failure-Atomic msync(): A Simple and Efficient Mechanism for Preserving the Integrity of Durable Data

Stan Park (University of Rochester), Terence Kelly (HP Labs), and Kai Shen (University of Rochester)

Composing OS extensions safely and efficiently with Bascule

Andrew Baumann (Microsoft Research), Dongyoon Lee (University of Michigan), Pedro Fonseca (MPI Software Systems), and Jacob R. Lorch, Barry Bond, Reuben Olinsky, and Galen C. Hunt (Microsoft Research)

Session 3: Miscellaneous

Hypnos: Understanding and Treating Sleep Conflicts in Smartphone

Abhilash Jindal, Abhinav Pathak, Y. Charlie Hu, and Samuel Midkiff (Purdue University)

They analyze several sleep conflicts when the state machine of the smartphone fails, and the device is not effectively suspended. Tested on Nexus One and Galaxy S devices (3+ year old). Looks a bit weak.

Prefetching Mobile Ads: Can advertising systems afford it?

Prashanth Mohan (UC Berkeley) and Suman Nath and Oriana Riva (Microsoft Research)

It is MS data of course, but… just go to Breaking for Commercials: Characterizing Mobile Advertising

Maygh: Building a CDN from client web browsers

Liang Zhang, Fangfei Zhou, Alan Mislove, and Ravi Sundaram (Northeastern University)

Maygh is a web-based CDN implemented with HTML 5.  Content is cached through HTML 5 LocalStorage, 5 MB with programmatic control. . Novelty, no client modification at all (browser, not plugins such as firecoral). Implemented with RTMFP (Flash), WebRTC, key is NAT traversal via STUN.

Architecture is based on a proxy, the Maigh coordinator, maintaning a directory for content, via hashing. The idea works in principle, it has not been developed beyond a proof of concept (scalabiilty, security are not addressed properly). Interesting read.

Wednesday 17th April

Session 1: Virtualization

hClock: Hierarchical QoS for Packet Scheduling in a Hypervisor

Jean-Pascal Billaud and Ajay Gulati (VMware, Inc.)

RapiLog: Reducing System Complexity Through Verification

Gernot Heiser, Etienne Le Sueur, Adrian Danis, and Aleksander Budzynowski (NICTA and UNSW) and Tudor-Ioan Salomie and Gustavo Alonso (ETH Zurich)

Application Level Ballooning for Efficient Server Consolidation

Tudor-Ioan Salomie, Gustavo Alonso, and Timothy Roscoe (ETH Zurich) and Kevin Elphinstone (UNSW and NICTA)

Currently many applications (e.g. databases) are not designed to work fairly in a virtualized environment, where resources such as memory are shared and dynamically assigned. instead, they grab hold of the resources, whereas in many cases they could be working with much less memory and perform similarly.

The paper proposes a technique for ‘ballooning’ applications, so that the amount of memory assigned to them can be expanded, or squeezed. It requires modification of the applications and was developed for MySQL and the OpenJDK JVM, over Xen. Very interesting paper.

Session 2: Scheduling and performance isolation

Omega: flexible, scalable schedulers for large compute clusters


Malte Schwarzkopf (University of Cambridge Computer Laboratory), Andy Konwinski (University of California Berkeley), and Michael Abd-el-Malek and John Wilkes (Google Inc.)

Omega is the upcoming scheduler for Google datacenters. It performs heterogeneous scheduling, segregating types of jobs, batch and services, with priority for batch (as they are orders of magnitude more). Solution: multiple schedulers, with shared state, and optimistic concurrency . They had to add some optimizations because constraints cause an enormous number of conflicts up when simulating realistic scenarios. The approach allows having  custom schedulers per application type, and they show an example that improves MR scheduling playing around the pattern in user preferences. Very good paper, well-deserved award.

Choosy: Max-Min Fair Sharing for Datacenter Jobs with Constraints

Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica (UC Berkeley)

Same problem, resource allocation in multitenant datacenters. They apply a constrained max min algorithm to guarantee fairness. The algo recursively maximizes allocation for the user with fewest machines. Nice flow filling model. Offline optimum, online approximation. Very interesting contrast with the previous paper, one very practical, based on the real Google workload, and this one much more academic in the resource allocation problem.

CPI2: CPU performance isolation for shared compute clusters

Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, vrigo Gokhale, and John Wilkes (Google, Inc.)

Performance isolation does not have perfectly in practice because of all the contending resources such as cache memory. They have implemented detection of problematic processes by monitoring hw performance counters, and ideally throttle the culprits.