Summary Reviews

In most conferences, the reviews, PC discussions and motivations for accepting a paper remain hidden from the audience, and attendees have to figure out by themselves what motivated the PC to accept a certain paper. As we know from experience that such information is interesting, this year EuroSys decided to make the process more transparent and publish a write-up of the reviews online.

STM in the small: trading generality for performance in software transactional memory

Aleksandar Dragojevic (EPFL) and Tim Harris (Microsoft Research)

The program committee felt that SpecTM represents a technically sound, interesting, and potentially promising point in the design space of optimistic concurrency control schemes and APIs. The PC appreciated that the paper was consistently clear and up-front about both the strengths and weaknesses of this approach, with respect to both traditional STM and CAS-based algorithms. The constraints SpecTM places on the developer, such as adding explicit sequence numbers to transactional memory accesses, seem reasonable at least for some interesting subset of concurrent algorithms and data structures, and the performance benefits are carefully evaluated and compelling at least at the "micro-benchmark" level. The scheme also has the advantage of being compatible with STM, thereby allowing for selective or incremental adoption in applications or libraries. Further, it seems likely that future compilers could generate and emit SpecTM operations automatically at least in certain cases.

On the other hand, the PC was also skeptical about how beneficial SpecTM's tradeoff will be in practice, or how widely applicable this approach will turn out to be. The paper makes the key assumption that writing SpecTM code will be "easier" than writing equivalent CAS-based code, which seemed intuitively or anecdotally plausible to some reviewers but is never actually evaluated in the paper. (The PC recognizes that evaluating "ease of programming" properties such as this would require careful user studies, which are not very common in the systems community, though perhaps they should be.) Even if SpecTM code is easier to write, how much easier? Will "ordinary parallel programmers" be able to understand SpecTM's rules and constraints well enough to write fast and correct SpecTM code, or will only "STM experts" ultimately be able to write good algorithms using SpecTM? Given the range of specializations and optimizes SpecTM provides, will the "correct" choices in writing well-optimized SpecTM code often depend not only on the algorithm in question but also on the workload characteristics and target environment? Will SpecTM's constraints preclude common programming idioms, such as transactions that need to compose or cross function boundaries? Will SpecTM prove applicable to a wider range of concurrent algorithms and data structures than the paper evaluates, and will its performance benefits extend beyond the paper's algorithm-focused microbenchmarks and prove effective in large, realistic applications?

Despite the many questions the paper leaves unanswered, the PC felt that SpecTM represents a well-motivated and carefully-considered point in design space between general STM and hand-coded concurrency schemes. Further, the PC felt that the paper presents a convincing case that SpecTM's design point is feasible and useful - at least for certain algorithms, under certain conditions, and in the hands of developers with the appropriate understanding. Finally, the PC felt that the paper could represent a springboard for future steps that may be able to generalize SpecTM further, or to preserve SpecTM's performance benefits while hiding its complexity, via compiler techniques for example.

Improving Server Applications with System Transactions

Sangman Kim, Michael Lee, Alan Dunn, and Owen S. Hofmann (The University of Texas at Austin), Xuan Wang (Stony Brook University), Emmett Witchel (The University of Texas at Austin), and Donald E. Porter (Stony Brook University)

The reviewers appreciated the fact that this paper provides an interesting experience of using OS transactions, including lessons that were learned and non-obvious issues that the authors came across when adapting real-world applications to use this paradigm. Furthermore, the paper also identifies functionality that needed to be added to an earlier OS transaction proposal, which turned out to be important for these applications to work well under this model. Given the importance of the topic, the amount of work that is involved, and since actual experience using memory transactional models is still news, the reviewers felt very positive about this paper. A potential problem that more than one reviewer raised was whether the additional features would erode the advantage of simplicity afforded by transactions. Finally, the reviewers mentioned that the evaluation in the submitted version left some room for improvement, and pointed out minor issues regarding novelty in comparison to the original TxOS and that the approach to the problem could have been more systematic.

Where is the energy spent inside my app? Fine Grained Energy Accounting on Smartphones with Eprof

Abhinav Pathak and Y. Charlie Hu (Purdue University) and Ming Zhang (Microsoft Research)

This paper makes use of eprof (Eurosys 2011) to carry out a detailed study of energy use in several Android smartphones by a range of popular applications. Some of the findings were surprising, such as significant battery life being sacrificed to tracking location or advertising during the use of an application whose main purpose is ostensibly something else (game, newspaper, etc). These problems were narrowed down to highly localized pieces of code ("bundles") with bursty activity that is power-hungry. Such observations helped the authors restructure applications to gain significant reductions in power consumption. The program committee appreciated a number of aspects of this work. First, the paper nails the evidence that location trackers and advertisers are using up valuable smartphone energy and invading our privacy -- we felt this to be an important results for the community. Second, the notion of "bundles" goes beyond previously proposed ways of measuring energy, and turns out to be a useful construct in energy profiling that paves the way to practical ways of understanding and debugging energy consumption in applications, whether smartphone-based or not. Third, the depth of evaluation and the realism of the prototype were universally liked. No paper is perfect. The program committee observed that establishing ground truth in these measurements is not easy, and so the measurements must be taken with a grain of salt; there is no independent measurement to confirm the accuracy of the accounting method. Due to limited space, it was also challenging to describe all the details of the implementation, which does not enable a foolproof interpretation of the results. Finally, the focus on profiling and optimization makes the scientific contribution appear incremental over the authors' work published in last year's EuroSys, but the strength and importance of the results are the ultimate criterion for judgment, which made this paper a strong submission.

Energy Efficiency for Large-Scale MapReduce Workloads with Significant Interactive Analysis

Yanpei Chen and Sara Alspaugh (UC Berkeley), Dhruba Borthakur (Facebook), and Randy Katz (UC Berkeley)

This paper starts out by analyzing a Facebook workload that includes interactive data analysis jobs. The authors refer to such workloads as MapReduce with Interactive Analysis or simply MIA. A key characteristic of these workloads is that most jobs run for short periods and access a relatively small amount of data. Based on these observations from this workload, the paper then proposes BEEMR a MapReduce workload manager that segregates interactive and batch workloads into separate sub-clusters to improve energy efficiency. The relatively few servers in the interactive sub-cluster are always active. BEEMR manages the much larger batch sub-cluster by creating “batching intervals” and classifying batch jobs with long tasks as “interruptible”. During a batching interval, BEEMR transitions all the servers to a low-power state (e.g., off) when all the batch jobs started in the interval complete; the interruptible jobs are interrupted at that point. The servers stay in the low-power state until it is time for the next batch interval. When that time comes, the interruptible jobs are re-started and the waiting batch jobs are started. The program committee felt that the paper deals with an important problem, namely improving the energy efficiency of data-processing frameworks. The paper includes an extensive analysis of an interesting real workload, and an effort to validate the BEEMR simulations. Still on the positive side, the BEEMR evaluation shows significant energy savings. The paper also opens up many avenues for future research. For example, further research is needed to create approaches for setting the BEEMR parameters automatically, and to investigate the relative benefits of completely segregating interactive and batch workloads into two separate clusters against those from BEEMR. The authors discuss these and other future research avenues in the paper.

GreenHadoop: Leveraging Green Energy in Data-Processing Frameworks

Inigo Goiri, Kien Le, and Thu D. Nguyen (Rutgers University), Jordi Guitart and Jordi Torres (UPC), and Ricardo Bianchini (Rutgers University)

The paper presents GreenHadoop, a modified version of the map/reduce framework Hadoop that is aware of green and brown energy availability. GreenHadoop aims to maximize the exploitation of green energy (from solar in the paper's case), while minimizing the cost of brown energy given brown energy price tariffs. The basic approach is to defer processing to match the processing load with green energy availability, or brown energy pricing.

The program committee thought the paper clearly presented a good discussion of the trade-offs involved in exploiting green energy, and a solid evaluation of the proposed system showed positive results.

It was noted that the approach taken was more broadly applicable than presented as the approach could respond to pricing indicators independent of the ultimate source of the energy. Also noted was the "plug-able" nature of energy predictors enabling incorporation of alternative green energy sources. The success and robustness of predicting workload energy requirements without user input was commented on favorably.

The PC thought the paper could be further improved by taking a more formal and mathematically rigorous approach to the presentation of the scheduling algorithm.

The PC also expressed concern about assertions regarding the economic merit of self-generation. The economics of self-generation are complex and depend on many factors (e.g. subsidies, data-center lifetime, etc...), however the PC thought the approach had merit independent of the energy source due to the ability to respond to price indicators.

Frugal Storage for Cloud File Systems

Krishna Puttaswamy, Thyaga Nandagopal, and Murali Kodialam (Bell Labs, Alcatel-Lucent)

This paper describes a FCFS (frugal file system service), a system that optimizes the overall storage cost among different Cloud storage services and types of different pricing. The main idea is to treat the underlying storage services as a dual storage system, one with low latency (e.g., Amazon ElastiCache) and the other with high latency (e.g., Amazon S3). The low latency system has low data transfer costs but the storage-per-byte costs are high (i.e., cache), while the high latency system has low storage-per-byte costs but high data transfer costs (i.e., disk). These properties are similar to caching systems and so caching solutions can be reused (although the difference is that the available storage size of each service can be varied).

The consensus at the PC meeting was that this paper might be useful for some cloud storage users and the idea is simple. The only major concern is that this paper did not consider the performance tradeoffs: how much degradation to the performance (latency and throughput)? There are also some concern about the trace-based evaluation, but it was still acceptable as a timely topic and potential usefulness.

Kineograph: Taking the Pulse of a Fast-Changing and Connected World

Raymond Cheng (University of Washington), Ji Hong (Fudan University), Aapo Kyrola (Carnegie Mellon University), Youshan Miao (University of Science and Technology of China), Xuetian Weng (Peking University), Ming Wu, Fan Yang, and Lidong Zhou (Microsoft Research Asia), Feng Zhao (Microsoft Research Asia), and Enhong Chen (University of Science and Technology of China)

The program committee felt that the paper presented a useful system for carrying out large-scale graph computation on rapidly changing graphs. Decoupling graph mining from graph updates to handle continuously changing graphs is very relevant in the context of the current popularity of social networking services. The performance evaluation based on a 49 machine cluster demonstrates that the system can cope with a graph update rate up to an order of magnitude higher than today's Twitter firehose. However, the program committee was concerned about several potential weaknesses exhibited by Kineograph. The submitted paper was lacking an evaluation of the system's fault tolerance properties which has been addressed in the revised version of the paper. The handling and decay of outdated data is not solved in the current system and the paper offers only preliminary solutions. The program committee also found the system itself to be fairly straight forward and that the paper did not sufficiently discuss its design choices and difficulties. Nonetheless, the program committee quite liked the system and thought that Kineograph makes significant contributions in an space of growing importance.

Jockey: Guaranteed Job Latency in Data Parallel Clusters

Andrew D. Ferguson (Brown University), Peter Bodik and Srikanth Kandula (Microsoft Research), Eric Boutin (Microsoft), and Rodrigo Fonseca (Brown University)

This paper proposes Jockey, a system that provides execution time service-level agreements (SLAs) for data-processing jobs in frameworks like MapReduce and Dryad. Jockey uses offline execution of the jobs in a simulator to estimate their remaining time, as a function of the progress they have already made and the amount of resources dedicated to them. Jockey's dynamic control policy monitors job progress and adjusts the resource allocation to ultimately meet the SLA.

The program committee felt that the paper deals with an important problem, namely providing quality of service guarantees for data-processing frameworks. The paper also includes a strong and well-executed evaluation of Jockey using real workloads on a real production cluster. One challenge, of course, is that Jockey depends on the accuracy of its simulations or analytic models to meet the SLAs. In addition, Jockey depends on parameters that need to be set by hand. Further experience with the system may allow some of them to be set automatically.

The Xen-Blanket: Virtualize Once, Run Everywhere

Dan Williams (Cornell/IBM), Hani Jamjoom (IBM T. J. Watson Research Lab), and Hakim Weatherspoon (Cornell University)

The paper describes Xen-Blanket, a system that uses nested virtualization to provide a homogenous VM abstraction across different clouds. An extensive evaluation shows that the overhead is low enough to be practical at least for certain applications. The paper also demonstrates some additional benefits that are enabled by nested virtualization, e.g., the ability to seamlessly migrate a VM from EC2 to another cloud, and a way to save money on EC2 by multiplexing several 'small' instances into one of the larger ones.

The program committee thought that, although nested virtualization technology has been explored before (e.g., in the Turtles paper, which was presented at OSDI 2010), this paper describes a nice use case with interesting practical implications: since the Xen Blanket does not require support from the underlying cloud and enables almost seamless migration between clouds, it can be used to create competition among cloud providers. The Xen Blanket comes at a nontrivial performance cost, but it also enables some other attractive features, such as user-defined oversubscription.

Isolating Commodity Hosted Hypervisors with HyperLock

Zhi Wang, Chiachih Wu, and Michael Grace (North Carolina State University) and Xuxian Jiang (North Carolina State Univeristy)

The paper describes an approach to reduce the TCB and thus the attack surface of hosted hypervisors by moving a part of the hypervisor into a per-VM protection domain. The approach is demonstrated by sandboxing a part of Linux/KVM using compiler-based methods (NaCl-like) such that it can continue running as kernel module and in kernel mode with few modifications.

The reviewers felt on the positive side, that the paper reports a novel point in the design space for VMM implementations that is demonstrated by a complete implementation and some evaluation. On the critical side, the reviewers were not fully convinced to which extent vulnerabilities are removed. They also were critical regarding the comparison with approaches that run per-VMM protection domains in user space.

Even so, the PC liked the paper and was quite happy to accept for the Eurosys program.

Delusional Boot: Securing Cloud Hypervisors without Massive Re-engineering

Anh M. Nguyen (UIUC), Himanshu Raj (Microsoft Research), Shravan Rayanchu (U. of Wisconsin), and Stefan Saroiu and Alec Wolman (Microsoft Research)

The security of "Infrastructure as a Service" providers depends to a large part on the sandboxing capability of the hypervisor used to securely multiplex physical machines. Unfortunately, the "attack surface" of an x86 hypervisor is quite large, because an existing OS running as a guest expects to see a wide variety of legacy PC hardware during the boot process, even though it barely touches this hardware once it is up and running.

Delusional Boot reduces this attack surface by removing support for such devices from the hypervisor code base (Hyper-V in this case). An untrusted guest OS is first booted on a hypervisor which supports full legacy hardware, but is itself sandboxed on a separate physical machine (and uses attestation). This is then snapshotted and migrated to a restricted hypervisor for production use.

Reviewers liked many of the details of the approach, in particular the analysis of device dependencies. The basic idea is sound, and the description of what was required is nice and clear. While it's unlikely that the system here could be reproduced outside of Microsoft, it's clear from the paper how to apply the idea to other hypervisors, and the device dependency analysis (which is most of the hard work) carries across.

The paper (perhaps inadvertently) raises interesting questions about what it means to have "simple", "clean", "narrow" interface to a secure system, and how such things can be measured.

However, there were concerns about the wider significance of the result. While one certainly expects that having less code in the hypervisor (and a narrower interface to virtual hardware) will lead to a more secure hypervisor, there is no quantitative argument in the paper (and it's unclear that one could be formulated), so we have to take it on trust that all this work is worthwhile. Indeed, information about sandboxing vulnerabilities on Hyper-V is not even public. As a result, this is not a "principles" paper, more a potentially useful (if disciplined, and interesting) hack.

A Critique of Snapshot Isolation

Daniel Gómez Ferro and Maysam Yabandeh (Yahoo! Research)

The committee felt this submission did a nice job of explaining a surprising result, that snapshot isolation can be tweaked to yield serializable executions with little or no performance cost. There was a concern that that the principal mechanism used was similar to that used by the TL2 transactional memory system, and that there was a need for a more comprehensive discussion of other techniques for ensuirng that snapshot-consistent executions are serializable. Both of these concerns were address in the final version.

LazyBase: Trading freshness for performance in a scalable database

Jim Cipar and Greg Ganger (CMU) and Kimberly Keeton, Charles B. Morrey III, Craig Soules, and Alistair Veitch (HP Labs)

This paper introduces LazyBase, a database which offers improved performance for applications which can tolerate reading from stale data. LazyBase seeks to simultaneously improve data ingest rate and give applications the ability to trade freshness for performance. LazyBase has a pipelined architecture with different stages for accepting updates, preprocessing updates, and applying updates in batches. Ordinary queries access only the data in the last stage; queries with higher freshness demands can look at the data in additional stages. The evaluation shows that LazyBase scales better than Cassandra and outperforms it by a factor of two to five.

The program committee very much enjoyed this paper which addresses a new and relevant problem. The idea of decomposing the ingest into a pipelined architecture allows the implementation of a number of interesting optimizations. The evaluation is very thorough and demonstrates that this architecture features very good performance properties compared to Cassandra.

This paper advocates separating the traditional fuzzy notion of eventual consistency into more well-defined properties, including what subset of updates are visible to readers (e.g., updates across the entire dataset or to just a subset), in what order those updates appear, and whether an update remains visible to a reader once the reader has seen it. The authors distinguish these properties from freshness (how long it takes to see the updates), which is the whole point of the paper: in LazyBase, read queries always return consistent results but they have the ability to trade freshness for performance.

Cache Craftiness for Fast Multicore Key-Value Storage

Yandong Mao (MIT), Eddie Kohler (Harvard), and Robert Morris (MIT)

The paper presents a tree based key value store optimized for large numbers of small key/small data sets with lots of updates, something common in emerging internet applications. They then go into their design, drawing on related work and incorporating key pieces from various sources to give themselves various performance and efficiency advantages, particularly on multicore. Their evaluation shows distinct advantages over another contemporary key/value stores and a contemporary database (MongoDB and VoltDB). More specifically, in terms of ops per second it is at least an order of magnitude faster than state-of-the-art in-memory databases. When compared with memcached they achieve relative performance parity for reads and advantages over memcached for writes.

The initial consensus of the program committee was that the paper was interesting, well written and addressed an important problem with a heavily used component in today's distributed systems. The treatment of the related work was very thorough, and the author's implementation relied heavily on the careful integration of prior art to achieve their performance. The authors did a good job outlining what their particular contributions were in relation to the prior art they integrated. The evaluation methodology and analysis were clearly written and thorough.

There was much initial concern over the evaluation, in particular that it focused on evaluation against heavier weight databases. In their rebuttal, the authors promised a more detailed comparison against Redis which seemed to be a better comparison point which they delivered on in the final version of the paper. The other major concern was that there wasn't sufficiently new ideas in the paper since it was predominately based on existing scalable data structures -- however, I believe the general feeling was the combination and optimization of these elements made the paper sufficiently novel on its own.

MadLINQ: Large-Scale Distributed Matrix Computation for the Cloud

Zhengping Qian and Xiuwei Chen (Microsoft Research Asia), Nanxi Kang and Mingcheng Chen (Shanghai Jiaotong University), Yuan Yu (Microsoft Research Silicon Valley), and Thomas Moscibroda and Zheng Zhang (Microsoft Research Asia)

The program committee appreciated MadLINQ's numerous advantages: MadLINQ unifies in a familiar imperative language both relational and matrix operations (the latter include graph analyses as a special case). Such unification is valuable because many complex real applications include some parts that are well suited to relational manipulation, other parts better suited to matrix operations, and still other parts that demand ordinary imperative programming. MadLINQ furthermore provides a unified approach to frugal re-computation via a novel lightweight pipelined dataflow DAG execution protocol. Frugal re-computation is a desirable way to cope with failures in cloud deployments. Furthermore, minimalist re-computation is also extremely useful for dealing with minor perturbations of input data (e.g., adding a single new entry to a matrix, adding a new edge to a graph). MadLINQ provides a unified solution to re-computation regardless of motive. The PC also appreciated the thoroughness of the contribution as a top-to-bottom systems paper, spanning language-level issues to cluster deployment, and the diligence of the implementation as exemplified by the use of TLA verification.

The PC felt that the initial submission of this paper was unclear in places and emphasized examples that did not showcase MadLINQ's advantages in the most compelling possible way. These clarity and presentation issues have been addressed in the final version. One concern that was partially but not completely addressed was that the full generality of the paper's contributions aren't articulated as thoroughly as possible. For example, there's good reason to believe that the fine-grained pipelining execution model is applicable beyond tile-based matrix computations. Similarly the full generality and usefulness of the re-computation mechanism could be emphasized and stated more clearly. Overall the PC felt that the paper teaches powerful, principled, general new methods and will be enjoyed by EuroSys attendees.

Jettison: Efficient Idle Desktop Consolidation with Partial VM Migration

Nilton Bila and Eyal de Lara (University of Toronto), Kaustubh Joshi, H. Andres Lagar-Cavilla, and Matti Hiltunen (AT&T Labs Research), and Mahadev Satyanarayanan (Carnegie Mellon University)

This paper introduces Partial VM Migration, a technique that transparently migrates only the working set of an idle VM running on an idle desktop computer to a consolidation server. The desktop can then be sent to a low-power state. If the partially migrated VM then accesses memory or disk blocks that are not available locally, the desktop may have to be awakened (via wake-on-LAN) to serve those accesses. The technique was implemented in Jettison, a prototype system for consolidating idle VMs.

The results show that Jettison achieves high energy savings and low network traffic, while providing reasonable responsiveness to users. The PC felt that the paper's idea was reasonably novel, and the analysis was good.

There was some feeling in the PC that this was only an incremental contribution, since full migration for power savings with desktops has been used before, and on-demand migration of VMs is used in other contexts. There was also a number of concerns, addressed through shepherding on use of paravirtualization, migration over the WAN, lack of discussion on failure semantics, and analysis with workloads when "idle".

Practical TDMA for Datacenter Ethernet

Bhanu C. Vattikonda, George Porter, Amin Vahdat, and Alex C. Snoeren (University of California, San Diego)

The reviewers all agreed that this paper was interesting, due to the practical approach taken. The basic idea is to use a rarely employed feature of the Ethernet MAC layer protocol that allows a switch to introduce very short time frame pauses at each host - this can be re-purposed and used by a centralised scheduler to then allocate slots for senders, to mitigate the well-known TCP incast behaviour of map/reduce style workloads.

In contrast to specialised hardware used for Real Time Ethernet, this paper just involved a software shim on top of existing NIC drivers, and works (empirically) with existing widely used commodity switches.

Several of the reviewers questioned the way the software was done, and asked whether this couldn't have been integrated more with the network virtualisation on a hypervisor, rather than relying on the Myricom user space packet processing. There was also a discussion about the impact on long-lived flows, which the authors responded to, and the overheads of collecting the data that feeds into the aforesaid scheduler. These questions could be re-posed at the conference presentation and generate useful discussion of where the work can lead next.

Scalable Testing of File System Checkers

Joao Carreira and Rodrigo Rodrigues (MPI-SWS), George Candea (EPFL), and Rupak Majumdar (MPI-SWS)

In this paper the authors describe a testing tool called SWIFT that is used to test file system checkers like fsck. The authors describe their testing methodology, tools, and experiments. Their methodology consists of taking the checking algorithm of the existing fsck tool as a golden model and validate it against the repair algorithm. You would think that this relationship always holds but as the authors show it turns out not to be the case for a variety of file systems that they explored. Including the often used Linux e2fs and Reiser file systems.

The program committee felt that this paper was a good example of test generation which is a critical aspect of system software development. That said, we had an extensive debate about whether the assumed golden model was sufficient and whether a formal model specification would form a better basis for generating these tests. The reviewers were also concerned with the limitation of the detection methodology, especially the single field corruption model which they felt was overly restrictive.

Delta FTL: Improving SSD Lifetime via Exploiting Content Locality

Guanying Wu and Xubin He (Virginia Commonwealth University)

This paper proposes an interesting flash translation design for solid-state drives whose goal is to reduce wear. The design works by delta-encoding modifications arising from repeated writes to the same block; their system identifies blocks that are repeatedly written but rarely read, and for these, the extra read overhead necessitated by combining the deltas to reproduce the original data is more than made up for by the savings from writes. Their simuation-based results demonstrate that garbage collection writes are reduced by a factor of 2, write latency is reduced by more than 33%, and read latency increases by only 5 or 6%.

The reviewers were enthusiastic about the idea: trading off reduced wear and improved write latency for slightly reduced read latency might be problematic, but the the paper was convincing in demonstrating that the system could automatically identify the write-hot / read-cold blocks that benefit. Some reviewers had concerns that important metrics (such as the lifetime of the SSD, or overall benchmark run-time rather than average I/O latency) were not provided, but overall the feeling was that the evaluation that was provided was robust. Other reviewers were concerned about potential interactions between delta encoding and the ability to deduplicate, or between file-system level compression or encryption and the ability of a FTL to compute effective deltas at the block level; this seems more like a more fundamental issue.

Overall, then the reviewers' consensus was that the novelty of the idea and the promise demonstrated by the evaluation outweighed the concerns about any specific methodological inadequacies, and therefore the PC decided to accept the paper.

FlashTier: A Lightweight, Consistent and Durable Storage Cache

Mohit Saxena, Michael M. Swift, and Yiying Zhang (University of Wisconsin-Madison)

FlashTier takes the intuitive idea that flash-based storage devices present a great opportunity as a performance-enhancing tier in front of conventional magnetic disks, and explores how both operating system and device implementations might usefully change in order to better support it. In particular, the authors observe that solid-state disks might do a lot better at implementing caches if they stopped trying so hard to be disks: the system eschews complexity surrounding things like FTL-based garbage collection in favor of relatively simple interfaces that allow the device to provide efficient caching, and quickly recover to a useful state after reboots.

The reviewers gave this paper a lot of credit for its whole-system consideration of the problem and for exploring a solution that didn't limit itself to a specific software layer. There was also a degree of enthusiasm for a position that argued for specific hardware changes in order to make software more effective -- basically that caching would be better served by a different sort of solid state device. Some concerns were raised, in particular around an evaluation that was entirely based on simulation, and a failure to compare the benefits of this approach to existing, deployed systems with commodity SSDs. However, there was also broad agreement that the system had been carefully and convincingly evaluated.

Fast Black-Box Testing of System Recovery Code

Radu Banabic and George Candea (EPFL)

The paper argues in favor of automated injection of faults for testing applications and describes a method that can somewhat efficiently search the space of possible faults, so that testing can be done more efficiently. If a bug is discovered the system looks for related manifestations of the same bug. The authors applied the method to MySQL, Apache, and some UNIX utilities and, in doing so, found some new bugs.

All reviewers greatly appreciated the usefulness of an automated tool to search the (enormous) fault-space. Moreover, the method seems to work, as witnessed by the bugs it found in mature code.

The PC had some concerns about how well the technique will work on immature code. Unlike Apache, say, which has well-honed and systematic error-handling mechanisms and thus may be relatively easy to find, code that is not mature may not have such a useful regularity. The question raised in the PC was whether this could lead to false positives.

There was also a question whether the -- intellectually interesting -- contribution will really turn out to be that significant in practice. Especially, given the already existing systems that auto-detect the kinds of faults injected by the method proposed in the paper.

Overall, however, the PC liked the paper. The practical results on real software are nice, the explanation is very clearly, and given the importance of research on testing, there was an easy consensus to accept this paper.

CheapBFT: Resource-efficient Byzantine Fault Tolerance

Rüdiger Kapitza (TU Braunschweig), Johannes Behl (Friedrich-Alexander University Erlangen-Nuremberg), Christian Cachin (IBM Research - Zurich), Tobias Distler and Simon Kuhnle (Friedrich-Alexander University Erlangen-Nuremberg), Seyed Vahid Mohammadi (KTH – Royal Institute of Technology), and Wolfgang Schröder-Preikschat and Klaus Stengel (Friedrich-Alexander University Erlangen-Nuremberg)

The paper address one of the main weaknesses of Byzantine Fault Tolerance: the large number of replicas required to achieve fault tolerance. A novel mechanism is presented that reduces the number of replicas by relying on trusted hardware, a custom FPGA, that in the presence of f faults reduces the number of replicas required to 2f+1, of which only f+1 are actively processing requests in the fault-free case. In addition, f additional replicas are supposed to witness the process and become active when faults are detected, in which case a change protocol is used to transition between the fast and the slow protocol. Thus the system is a hybrid between a fast-but-optimistic and a slow-but-safe Byzantine Fault Tolerance protocol, and a transition protocol to switch between the two.

The fundamental assumption is that the machines running the protocols are safe from physical tampering and that they run an FPGA-based trusted sub-system that supports message certificates. The resulting system performs much better than other protocols in scenarios where no fault occurs. During a fault, the system becomes briefly unavailable while the protocol switch takes place.

The paper is well-written, attacks an important problem, and presents a cool piece of custom hardware for BFT protocols.

The reliance on special purpose hardware, despite being an interesting idea, could be supported by stronger arguments and ideally with some evidence that it's a good choice for supporting the trusted components. In particular, it would be good to have numbers about the failure rate of FPGAs and about the reliability of the software infrastructure involved to program the FGPAs.

A main concern of the reviewers is that the protocols do not seem to be proven correct---such a proof would strengthen the paper considerably.

Another concern is the applicability of the approach: what applications will benefit from this type of system? If the motivation hinges on the premise that reducing the cost of using many replicas is the main problem for adoption of BFT The trouble is with the way the paper positions itself. The key argument that is being made is that BFT protocols are not being adopted more widely because they require too many resources. We can debate whether this is the only (or indeed the main) reason, but even if we were to agree on this, the paper doesn't really appear to do anything to solve this problem. Basically, the protocol is optimized for the (hopefully common) fault-free case, in which it only uses f+1 active replicas. But we still have to provision 2f+1 replicas overall, so we still have to buy the same number of machines as with, say, A2M-PBFT-EA (which also relies on trusted hardware). True, f of the 2f+1 machines are relatively lightly loaded, but how can we benefit from this; for instance, we can't really use the machines to do other work, because they may need to be activated at any time when a fault appears.

All in all it is, despite its limitations, a good paper.

Canal: Scaling social network-based Sybil tolerance schemes

Bimal Viswanath, Mainack Mondal, and Krishna Gummadi (MPI-SWS), Alan Mislove (Northeastern University), and Ansley Post (MPI-SWS)

The program committee felt that this paper had two core contributions; namely, its modeling of past work on Sybil-tolerance into a single unified framework based on max-flow payments in credit networks and its approach to make the credit flow computation more scalable through the use of landmark routing. The committee's main concerns revolved around the significance of Sybil attacks and whether it made sense to attack the problem based solely on data about the social network graph and user interactions.. Specifically, is the significance of the Sybil problem eclipsed by the problem posed by legitimate identities hijacked by botted hosts? Are there other techniques for identifying Sybils, such as online behavioral profiling, that could potentially be far more effective than schemes that rely on the structure of the social networking graph coupled with data on user interactions? Does the approach make any assumptions about the mixing time of social networks, and do empirical measurements support this assumption? The committee felt that the paper's core contributions outweigh the first two concerns, and the last concern was addressed in the revised draft of the paper.

Improving Interrupt Response Time in a Verifiable Protected Microkernel

Bernard Blackham, Yao Shi, and Gernot Heiser (NICTA and University of New South Wales)

The committee held that the paper makes a strong contribution to the design and development of operating-system kernels (in particular microkernels) that are suited to domains at the intersection of security and real-time processing. Although the re-engineering aspects presented in the paper are discussed in a seL4 context, the PC considered them extremely valuable for other systems as well.

After a lengthy discussion in the PC about how much of the contribution was already in closely related work by the same group in RTSS'11 ('Timing analysis of a protected operating system kernel') , the PC concluded that the contributions were indeed different and both were interesting in their own right. This issue was extensively addressed during the shepherding process and resolved convincingly in the final version of the paper. Especially the kernel design parts of the paper as well as the evaluation section have improved considerably. In summary, the committee feels that the paper is especially strong at describing challenging facts in the balancing act of applying formal methods to time-predictable operating-system kernels.

Improving network connection locality on multicore systems

Aleksey Pesterev (MIT), Jacob Strauss (Quanta Research Cambridge), and Nickolai Zeldovich and Robert T. Morris (MIT)

There are basically two ways to get scalable operating system performance on many-core processors: redesign the kernel from scratch, as in the Barrelfish multikernel (Baumann et al., SOSP 2009), or fix the scaling bugs in a traditional kernel, as in prior work on Linux scalability from MIT (Boyd-Wickizer et al., OSDI 2010).

This paper takes the second approach, focussing on improving the scalability of TCP connection processing, for applications such as Web servers. In particular, the paper describes Affinity-Accept, a redesign of the accept() system call so that each core has its own lock, and so that all processing for a connection normally happens on one core. Such connection affinity leads to greatly improved CPU-cache affinity, and so to better performance.

The paper also addresses the need to handle the load imbalances that would result from rigidly enforcing affinity, and shows that this requires both fine-grained and coarse-grained load balancing. Their benchmark tests showed significant performance improvements.

The reviewers liked the simple, elegant, and thorough design, the highly readable presentation, and the detailed performance analysis. In that respect, this is a model of a well-written systems paper.

However, the paper's focus is narrow, and there is no attempt to generalize the lessons beyond the specific problem of network connection locality. The authors agree that the paper yields no broader insights into API design, although they do provide some suggestions for designers of future network interfaces. The PC would be happy to find that this paper inspires future work that does lead to broader insights.

TM2C: a Software Transactional Memory for Many-Cores

Vincent Gramoli, Rachid Guerraoui, and Vasileios Trigonakis (EPFL)

The comittee felt that this paper made valuable contributions to understanding how software transactional memory systems (and similar systems) could be implemented on non-cache-coherent architectures. There was some concern that it was unclear how to compare these results to similar systems on other architectures, an issue that was addressed in the final version of the paper by describing an implementation on a cache-coherent architecture.

BWS: Balanced Work Stealing for Time-Sharing Multicores

Xiaoning Ding (Intel ISTC for Cloud Computing), Kaibo Wang (The Ohio State University), Phillip B. Gibbons (Intel ISTC for Cloud Computing), and Xiaodong Zhang (The Ohio State University)

The paper presents a new work-stealing task scheduler, focusing on multiprogrammed environments; that is, environments in which multiple applications are running concurrently and each one using work-stealing. The paper demonstrates problems with the state of the art in multiprogrammed environments, presents an improved efficient Work-stealing Scheduler (BWS), and evaluates the implementation of BWS in cilk++.

The PC liked this paper because BWS solves what could be a real problem, BWS is implemented and evaluated for a real system (cilk++), the BWS algorithm is relatively simple, and the evaluation is thorough and demonstrates a real improvement. Finally, the paper is well written.


[30/04/12] Added some photos of the conference.

[19/04/12] With almost 300 participants and a great program, EuroSys 2012 was a success. Thank you for coming and see you next year!

[3/04/12] Conference proceedings.

[3/04/12] Program booklet.

[30/03/12] List of posters online.

[21/03/12] Detailed conference program online.

Follow us



In cooperation with