Verus: A Practical Foundation for Systems Verification.
Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), Awarded "Distinguished Artifact Award", November 2024. abstract code
Formal verification is a promising approach to eliminate bugs at compile time, before they ship. Indeed, our community has verified a wide variety of system software. However, much of this success has required heroic developer effort, relied on bespoke logics for individual domains, or sacrificed expressiveness for powerful proof automation. Building on prior work on Verus, we aim to enable faster, cheaper verification of rich properties for realistic systems. We do so by integrating and optimizing the best choices from prior systems, tuning our design to overcome barriers encountered in those systems, and introducing novel techniques. We evaluate Verus's effectiveness with a wide variety of case-study systems including distributed systems, an OS page table, a library for NUMA-aware concurrent data structure replication, a crash-safe storage system, and a concurrent memory allocator, together comprising 6.1K lines of implementation and 31K lines of proof. Verus verifies code 3--61x faster and with less effort than the state of the art. Our results suggest that Verus offers a platform for exploring the next frontiers in system verification research. Because Verus builds on Rust, Verus is also positioned for wider use in production by developers who have already adopted Rust in the pursuit of more robust systems.
|
Armada: Automated Verification of Concurrent Code with Sound Semantic Extensibility.
ACM Transactions on Programming Languages and Systems (TOPLAS) 44(2), June 2022. abstract code
Safely writing high-performance concurrent programs is notoriously difficult. To aid developers, we introduce Armada, a language and tool designed to formally verify such programs with relatively little effort. Via a C-like language and a small-step, state-machine-based semantics, Armada gives developers the flexibility to choose arbitrary memory layout and synchronization primitives so that they are never constrained in their pursuit of performance. To reduce developer effort, Armada leverages SMT-powered automation and a library of powerful reasoning techniques, including rely-guarantee, TSO elimination, reduction, and pointer analysis. All of these techniques are proven sound, and Armada can be soundly extended with additional strategies over time. Using Armada, we verify five concurrent case studies and show that we can achieve performance equivalent to that of unverified code.
|
Armada: Low-Effort Verification of High-Performance Concurrent Programs.
Proceedings of the ACM International Conference on Programming Language Design and Implementation (PLDI), Awarded "Distinguished Paper Award", June 2020. abstract video slides code artifact
Safely writing high-performance concurrent programs is notoriously difficult. To aid developers, we introduce Armada, a language and tool designed to formally verify such programs with relatively little effort. Via a C-like language and a small-step, state-machine-based semantics, Armada gives developers the flexibility to choose arbitrary memory layout and synchronization primitives so they are never constrained in their pursuit of performance. To reduce developer effort, Armada leverages SMT-powered automation and a library of powerful reasoning techniques, including rely-guarantee, TSO elimination, reduction, and alias analysis. All these techniques are proven sound, and Armada can be soundly extended with additional strategies over time. Using Armada, we verify four concurrent case studies and show that we can achieve performance equivalent to that of unverified code.
|
Capturing and Enhancing In Situ System Observability for Failure Detection.
Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2018. abstract code
Real-world distributed systems suffer unavailability due to various types of failure. But, despite enormous effort, many failures, especially gray failures, still escape detection. In this paper, we argue that the missing piece in failure detection is detecting what the requesters of a failing component see. This insight leads us to the design and implementation of Panorama, a system designed to enhance system observability by taking advantage of the interactions between a system's components. By providing a systematic channel and analysis tool, Panorama turns a component into a logical observer so that it not only handles errors, but also reports them. Furthermore, Panorama incorporates techniques for making such observations even when indirection exists between components. Panorama can easily integrate with popular distributed systems and detect all 15 real-world gray failures that we reproduced in less than 7 s, whereas existing approaches detect only one of them in under 300 s.
|
Vale: Verifying High-Performance Cryptographic Assembly Code.
Proceedings of the USENIX Security Symposium, Awarded "Distinguished Paper Award", August 2017. abstract code
High-performance cryptographic code often relies on complex hand-tuned assembly language that is customized for individual hardware platforms. Such code is difficult to understand or analyze. We introduce a new programming language and tool called Vale that supports flexible, automated verification of high-performance assembly code. The Vale tool transforms annotated assembly language into an abstract syntax tree (AST), while also generating proofs about the AST that are verified via an SMT solver. Since the AST is a first-class proof term, it can be further analyzed and manipulated by proven-correct code before being extracted into standard assembly. For example, we have developed a novel, proven-correct taint-analysis engine that verifies the code's freedom from digital side channels. Using these tools, we verify the correctness, safety, and security of implementations of SHA-256 on x86 and ARM, Poly1305 on x64, and hardware-accelerated AES-CBC on x86. Several implementations meet or beat the performance of unverified, state-of-the-art cryptographic libraries.
|
IronFleet: Proving Safety and Liveness of Practical Distributed Systems.
Communications of the ACM (CACM) 60(7), July 2017. abstract video code
Distributed systems are notorious for harboring subtle bugs. Verification can, in principle, eliminate these bugs, but it has historically been difficult to apply at full-program scale, much less distributed system scale. We describe a methodology for building practical and provably correct distributed systems based on a unique blend of temporal logic of actions-style state-machine refinement and Hoare-logic verification. We demonstrate the methodology on a complex implementation of a Paxos-based replicated state machine library and a lease-based sharded key-value store. We prove that each obeys a concise safety specification as well as desirable liveness requirements. Each implementation achieves performance competitive with a reference system. With our methodology and lessons learned, we aim to raise the standard for distributed systems from "tested" to "correct."
|
Gray Failure: The Achilles' Heel of Cloud-Scale Systems.
Proceedings of the ACM Workshop on Hot Topics in Operating Systems (HotOS), May 2017. abstract
Cloud scale provides the vast resources necessary to replace failed components, but this is useful only if those failures can be detected. For this reason, the major availability breakdowns and performance anomalies we see in cloud environments tend to be caused by subtle underlying faults, i.e., gray failure rather than fail-stop failure. In this paper, we discuss our experiences with gray failure in production cloud-scale systems to show its broad scope and consequences. We also argue that a key feature of gray failure is differential observability: that the system's failure detectors may not notice problems even when applications are afflicted by them. This realization leads us to believe that, to best deal with them, we should focus on bridging the gap between different components' perceptions of what constitutes failure.
|
Realizing the Fault-Tolerance Promise of Cloud Storage Using Locks with Intent.
Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), November 2016. abstract
Cloud computing promises easy development and deployment of large-scale, fault tolerant, and highly available applications. Cloud storage services are a key enabler of this, because they provide reliability, availability, and fault tolerance via internal mechanisms that developers need not reason about. Despite this, challenges remain for distributed cloud applications developers. They still need to make their code robust against failures of the machines running the code, and to reason about concurrent access to cloud storage by multiple machines. We address this problem with a new abstraction, called locks with intent, which we implement in a client library called Olive. Olive makes minimal assumptions about the underlying cloud storage, enabling it to operate on a variety of platforms including Amazon DynamoDB and Microsoft Azure Storage. Leveraging the underlying cloud storage, Olive's locks with intent offer strong exactly-once semantics for a snippet of code despite failures and concurrent duplicate executions. To ensure exactly-once semantics, Olive incurs the unavoidable overhead of additional logging writes. However, by decoupling isolation from atomicity, it supports consistency levels ranging from eventual to transactional. This flexibility allows applications to avoid costly transactional mechanisms when weaker semantics suffice. We apply Olive's locks with intent to build several advanced storage functionalities, including snapshots, transactions via optimistic concurrency control, secondary indices, and live table re-partitioning. Our experience demonstrates that Olive eases the burden of creating correct, fault-tolerant distributed cloud applications.
|
IronFleet: Proving Practical Distributed Systems Correct.
Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), October 2015. abstract video slides code
Distributed systems are notorious for harboring subtle bugs. Verification can, in principle, eliminate these bugs a priori, but verification has historically been difficult to apply at full-program scale, much less distributed-system scale. We describe a methodology for building practical and provably correct distributed systems based on a unique blend of TLA-style state-machine refinement and Hoare-logic verification. We demonstrate the methodology on a complex implementation of a Paxos-based replicated state machine library and a lease-based sharded key-value store. We prove that each obeys a concise safety specification, as well as desirable liveness requirements. Each implementation achieves performance competitive with a reference system. With our methodology and lessons learned, we aim to raise the standard for distributed systems from "tested" to "correct."
|
Tardigrade: Leveraging Lightweight Virtual Machines to Easily and Efficiently Construct Fault-Tolerant Services.
Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), May 2015. abstract slides
Many services need to survive machine failures, but designing and deploying fault-tolerant services can be difficult and error-prone. In this work, we present Tardigrade, a system that deploys an existing, unmodified binary as a fault-tolerant service. Tardigrade replicates the service on several machines so that it continues running even when some of them fail. Yet, it keeps the service states synchronized so clients see strongly consistent results. To achieve this efficiently, we use lightweight virtual machine replication. A lightweight virtual machine is a process sandboxed so that its external dependencies are completely encapsulated, enabling it to be migrated across machines. To let unmodified binaries run within such a sandbox, the sandbox also contains a library OS providing the expected API. We evaluate Tardigrade's performance and demonstrate its applicability to a variety of services, showing that it can convert these services into fault-tolerant ones transparently and efficiently.
|
Ironclad Apps: End-to-End Security via Automated Full-System Verification.
Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2014. abstract code
An Ironclad App lets a user securely transmit her data to a remote machine with the guarantee that every instruction executed on that machine adheres to a formal abstract specification of the app's behavior. This does more than eliminate implementation vulnerabilities such as buffer overflows, parsing errors, or data leaks; it tells the user exactly how the app will behave at all times. We provide these guarantees via complete, low-level software verification. We then use cryptography and secure hardware to enable secure channels from the verified software to remote users. To achieve such complete verification, we developed a set of new and modified tools, a collection of techniques and engineering disciplines, and a methodology focused on rapid development of verified systems software. We describe our methodology, formal results, and lessons we learned from building a full stack of verified software. That software includes a verified kernel; verified drivers; verified system and crypto libraries including SHA, HMAC, and RSA; and four Ironclad Apps.
|
Zero-Effort Payments: Design, Deployment, and Lessons.
Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing (Ubicomp), September 2014. abstract
This paper presents Zero-Effort Payments (ZEP), a seamless mobile computing system designed to accept payments with no effort on the customer's part beyond a one-time opt-in. With ZEP, customers need not present cards nor operate smartphones to convey their identities. ZEP uses three complementary identification technologies: face recognition, proximate device detection, and human assistance. We demonstrate that the combination of these technologies enables ZEP to scale to the level needed by our deployments. We designed and built ZEP, and demonstrated its usefulness across two real-world deployments lasting five months of continuous deployment, and serving 274 customers. The different nature of our deployments stressed different aspects of our system. These challenges led to several system design changes to improve scalability and fault-tolerance.
|
Composing OS Extensions Safely and Efficiently with Bascule.
Proceedings of the European Conference on Computer Systems (Eurosys), April 2013. abstract
Library OS (LibOS) architectures implement the OS personality as a user-mode library, giving each application the flexibility to choose its LibOS. This approach is appealing for many reasons, not least the ability to extend or customise the LibOS. Recent work with Drawbridge showed that an existing commodity OS (Windows 7) could be refactored to produce a LibOS while retaining application compatibility. This paper presents Bascule, an architecture for LibOS extensions based on Drawbridge. Rather than relying on the application developer to customise a LibOS, Bascule allows OS-independent extensions to be attached at runtime. Extensions interpose on a narrow binary interface of primitive OS abstractions, such as files and virtual memory. Thus, they are independent of both guest and host OS, and composable at runtime. Since an extension runs in the same process as an application and its LibOS, it is safe and efficient. Bascule demonstrates extension reuse across diverse guest LibOSes (Windows and Linux) and host OSes (Windows and Barrelfish). Current extensions include file system translation, checkpointing, and architecture adaptation.
|
Shroud: Enabling Private Access to Large-Scale Data in the Data Center.
Proceedings of the USENIX Conference on File and Storage Technologies (FAST), February 2013. abstract slides
Recent events have shown online service providers the perils of possessing private information about users. Encrypting data mitigates but does not eliminate this threat: The pattern of data accesses still reveals information. Thus, we present Shroud, a general storage system that hides data access patterns from the servers running it, protecting user privacy. Shroud functions as a virtual disk with a new privacy guarantee: the user can look up a block without revealing the block's address. Such a virtual disk can be used for many purposes, including map lookup, microblog search, and social networking. Shroud aggressively targets hiding accesses among hundreds of terabytes of data. We achieve our goals by adapting oblivious RAM algorithms to enable large-scale parallelization. Specifically, we show, via new techniques such as oblivious aggregation, how to securely use many inexpensive secure coprocessors acting in parallel to improve request latency. Our evaluation combines large-scale emulation with an implementation on secure coprocessors and suggests that these adaptations bring private data access closer to practicality.
|
Don't Lose Sleep Over Availability: The GreenUp Decentralized Wakeup Service.
Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), April 2012. abstract
Large enterprises can save significant energy and money by putting idle desktop machines to sleep. Many systems that let desktops sleep and wake them on demand have been proposed, but enterprise IT departments refuse to deploy them because they require special hardware, disruptive virtualization technology, or dedicated per-subnet proxies, none of which are cost-effective. In response, we devised GreenUp, a minimal software-only system that allows any machine to act as a proxy for other sleeping machines in its subnet. To achieve this, GreenUp uses novel distributed techniques that spread load through randomization, efficiently synchronize state within a subnet, and maintain a minimum number of proxies despite the potential for correlated sleep times. In this paper, we present the details of GreenUp's design as well as a theoretical analysis demonstrating its correctness and efficiency, using empirically-derived models where appropriate. We also present results and lessons from a seven-month live deployment on over 100 machines; a larger deployment on ~1,100 machines is currently ongoing.
|
Enabling Security in Cloud Storage SLAs with CloudProof.
Proceedings of the USENIX Annual Technical Conference (ATC), June 2011. abstract
Several cloud storage systems exist today, but none of them provide security guarantees in their Service Level Agreements (SLAs). This lack of security support has been a major hurdle for the adoption of cloud services, especially for enterprises and cautious consumers. To fix this issue, we present CloudProof, a secure storage system specifically designed for the cloud. In CloudProof, customers can not only detect violations of integrity, write-serializability, and freshness, they can also prove the occurrence of these violations to a third party. This proof-based system is critical to enabling security guarantees in SLAs, wherein clients pay for a desired level of security and are assured they will receive a certain compensation in the event of cloud misbehavior. Furthermore, since CloudProof aims to scale to the size of large enterprises, we delegate as much work as possible to the cloud and use cryptographic tools to allow customers to detect and prove cloud misbehavior. Our evaluation of CloudProof indicates that its security mechanisms have a reasonable cost: they incur a latency overhead of only 15\% on reads and writes, and reduce throughput by around 10\%. We also achieve highly scalable access control, with membership management (addition and removal of members' permissions) for a large proprietary software with more than 5000 developers taking only a few seconds per month.
|
Memoir: Practical State Continuity for Protected Modules.
Proceedings of the IEEE Symposium on Security and Privacy (S&P), May 2011. abstract slides
To protect computation, a security architecture must safeguard not only the software that performs it but also the state on which the software operates. This requires more than just preserving state confidentiality and integrity, since, e.g., software may err if its state is rolled back to a correct but stale version. For this reason, we present Memoir, the first system that fully ensures the continuity of a protected software module's state. In other words, it ensures that a module's state remains persistently and completely inviolate. A key contribution of Memoir is a technique to ensure rollback resistance without making the system vulnerable to system crashes. It does this by using a deterministic module, storing a concise summary of the module's request history in protected NVRAM, and allowing only safe request replays after crashes. Since frequent NVRAM writes are impractical on modern hardware, we present a novel way to leverage limited trusted hardware to minimize such writes. To ensure the correctness of our design, we develop formal, machine-verified proofs of safety. To demonstrate Memoir's practicality, we have built it and conducted evaluations demonstrating that it achieves reasonable performance on real hardware. Furthermore, by building three useful Memoir-protected modules that rely critically on state continuity, we demonstrate Memoir's versatility.
|
The Utility Coprocessor: Massively Parallel Computation from the Coffee Shop.
Proceedings of the USENIX Annual Technical Conference (ATC), June 2010. abstract
UCop, the "utility coprocessor," is middleware that makes it cheap and easy to achieve dramatic speedups of parallelizable, CPU-bound desktop applications using utility computing clusters in the cloud. To make UCop performant, we introduced techniques to overcome the low available bandwidth and high latency typical of the networks that separate users' desktops from a utility computing service. To make UCop economical and easy to use, we devised a scheme that hides the heterogeneity of client configurations, allowing a single cluster to serve virtually everyone: in our Linux-based prototype, the only requirement is that users and the cluster are using the same major kernel version. This paper presents the design, implementation and evaluation of UCop, employing 32-64 nodes in Amazon EC2, a popular utility computing service. It achieves 6-11x speedups on CPU-bound desktop applications ranging from video editing and photorealistic rendering to strategy games, with only minor modifications to the original applications. These speedups improve performance from the coffee-break timescale of minutes to the 15-20 second timescale of interactive performance.
|
Crom: Faster Web Browsing Using Speculative Execution.
Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), April 2010. abstract
Early web content was expressed statically, making it amenable to straightforward prefetching to reduce userperceived network delay. In contrast, today's rich web applications often hide content behind JavaScript event handlers, confounding static prefetching techniques. Sophisticated applications use custom code to prefetch data and do other anticipatory processing, but these custom solutions are costly to develop and application-specific. This paper introduces Crom, a generic JavaScript speculation engine that greatly simplifies the task of writing low-latency, rich web applications. Crom takes preexisting, non-speculative event handlers and creates speculative versions, running them in a cloned browser context. If the user generates a speculated-upon event, Crom commits the precomputed result to the real browser context. Since Crom is written in JavaScript, it runs on unmodified client browsers. Using experiments with speculative versions of real applications, we show that pre-commit speculation overhead easily fits within user think time. We also show that speculatively fetching page data and precomputing its layout can make subsequent page loads an order of magnitude faster.
|
Matchmaking for Online Games and Other Latency-Sensitive P2P Systems.
Proceedings of the Conference on Computer Communications (SIGCOMM), August 2009. abstract slides
The latency between machines on the Internet can dramatically affect users' experience for many distributed applications. Particularly, in multiplayer online games, players seek to cluster themselves so that those in the same session have low latency to each other. A system that predicts latencies between machine pairs allows such matchmaking to consider many more machine pairs than can be probed in a scalable fashion while users are waiting. Using a far-reaching trace of latencies between players on over 3.5 million game consoles, we designed Htrae, a latency prediction system for game matchmaking scenarios. One novel feature of Htrae is its synthesis of geolocation with a network coordinate system. It uses geolocation to select reasonable initial network coordinates for new machines joining the system, allowing it to converge more quickly than standard network coordinate systems and produce substantially lower prediction error than state-of-the-art latency prediction systems. For instance, it produces 90th percentile errors less than half those of iPlane and Pyxida. Our design is general enough to make it a good fit for other latency-sensitive peer-to-peer applications besides game matchmaking.
|
TrInc: Small Trusted Hardware for Large Distributed Systems.
Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), Awarded "Best Paper", April 2009. abstract slides
A simple yet remarkably powerful tool of selfish and malicious participants in a distributed system is "equivocation": making conflicting statements to others. We present TrInc, a small, trusted component that combats equivocation in large, distributed systems. Consisting fundamentally of only a non-decreasing counter and a key, TrInc provides a new primitive: unique, once-in-alifetime attestations. We show that TrInc is practical, versatile, and easily applicable to a wide range of distributed systems. Its deployment is viable because it is simple and because its fundamental components---a trusted counter and a key---are already deployed in many new personal computers today. We demonstrate TrInc's versatility with three detailed case studies: attested append-only memory (A2M), PeerReview, and BitTorrent. We have implemented TrInc and our three case studies using real, currently available trusted hardware. Our evaluation shows that TrInc eliminates most of the trusted storage needed to implement A2M, significantly reduces communication overhead in PeerReview, and solves an open incentives issue in BitTorrent. Microbenchmarks of our TrInc implementation indicate directions for the design of future trusted hardware.
|
Leveraging Legacy Code to Deploy Desktop Applications on the Web.
Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), December 2008. abstract
Xax is a browser plugin model that enables developers to leverage existing tools, libraries, and entire programs to deliver feature-rich applications on the web. Xax employs a novel combination of mechanisms that collectively provide security, OS-independence, performance, and support for legacy code. These mechanisms include memory-isolated native code execution behind a narrow syscall interface, an abstraction layer that provides a consistent binary interface across operating systems, system services via hooks to existing browser mechanisms, and lightweight modifications to existing tool chains and code bases. We demonstrate a variety of applications and libraries from existing code bases, in several languages, produced with various tool chains, running in multiple browsers on multiple operating systems. With roughly two person-weeks of effort, we ported 3.3 million lines of code to Xax, including a PDF viewer, a Python interpreter, a speech synthesizer, and an OpenGL pipeline.
|
Donnybrook: Enabling Large-Scale, High-Speed, Peer-to-Peer Games.
Proceedings of the Conference on Computer Communications (SIGCOMM), August 2008. abstract
Without well-provisioned dedicated servers, modern fast-paced action games limit the number of players who can interact simultaneously to 16-32. This is because interacting players must frequently exchange state updates, and high player counts would exceed the bandwidth available to participating machines. In this paper, we describe Donnybrook, a system that enables epicscale battles without dedicated server resources, even in a fastpaced game with tight latency bounds. It achieves this scalability through two novel components. First, it reduces bandwidth demand by estimating what players are paying attention to, thereby enabling it to reduce the frequency of sending less important state updates. Second, it overcomes resource and interest heterogeneity by disseminating updates via a multicast system designed for the special requirements of games: that they have multiple sources, are latency-sensitive, and have frequent group membership changes. We present user study results using a prototype implementation based on Quake III that show our approach provides a desirable user experience. We also present simulation results that demonstrate Donnybrook'sefficacy in enabling battles of up to 900 players.
|
A Five-Year Study of File-System Metadata.
ACM Transactions on Storage 3(3), October 2007. abstract
For five years, we collected annual snapshots of file-system metadata from over 60,000 Windows PC file systems in a large corporation. In this article, we use these snapshots to study temporal changes in file size, file age, file-type frequency, directory size, namespace structure, file-system population, storage capacity and consumption, and degree of file modification. We present a generative model that explains the namespace structure and the distribution of directory sizes. We find significant temporal trends relating to the popularity of certain file types, the origin of file content, the way the namespace is used, and the degree of variation among file systems, as well as more pedestrian changes in size and capacities.We give examples of consequent lessons for designers of file systems and related software.
|
Enhancing Game-Server AI with Distributed Client Computation.
Proceedings of the International Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV), June 2007. abstract
In the context of online role-playing games, we evaluate offloading AI computation from game servers to game clients. In this way, the aggregate resources of thousands of participating client machines can enhance game realism in a way that would be prohibitively expensive on a central server. Because offloading can add significant latency to a computation normally executing within a game server's main loop, we introduce the mechanism of AI partitioning: splitting an AI into a high-frequency but computationally simple component on the server, and a lowfrequency but computationally intensive component offloaded to a client. By designing the client-side component to be stateless and deterministic, this approach also facilitates rapid handoff, preemptive migration, and replication, which can address the problems of client failure and exploitation. To explore this approach, we develop an improved AI for tactical navigation, a challenging task to offload because it is highly sensitive to latency. Our improvement is based on calculating influence fields, partitioned into server-side and client-side components by means of a Taylor series approximation. Experiments on a Quake-based prototype demonstrate that this approach can substantially improve the AI's abilities, even with server-clientserver latencies up to one second.
|
Maximizing Total Upload in Latency-Sensitive P2P Applications.
Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), June 2007. abstract
Motivated by an application in distributed gaming, we define and study the latency-constrained total upload maximization problem. In this problem, a peer-to-peer overlay network is modeled as a complete graph and each node vi has an upload bandwidth capacity ci and a set of receivers R(i). Each sender-receiver pair (vi, vj), where vj in R(i), is a request that should be satisfied, i.e., vi should send a data packet to each vj in R(i). The goal is to find a set of at most n multicast-trees Ti of depth at most 2, such that each node can be part of multiple trees, all capacity constraints are met, and the number of satisfied requests is maximized. In this paper, we prove that the problem is NP-complete, and we present an algorithm with approximation ratio 1 - 2/sqrt(cmin), where cmin is the minimum upload capacity. Finally, we also study the impact of network coding on the quality and approximability of the solution.
|
Scaling Peer-to-Peer Games in Low-Bandwidth Environments.
Proceedings of the International Workshop on Peer-to-Peer Systems (IPTPS), February 2007. abstract
In peer-to-peer multiplayer games, each peer must send periodic updates of its objects to other peers. Since typical broadband users have little upload bandwidth, updates to each player will be infrequent when there are many players in the game. This leads to choppy and unsatisfying gameplay. Therefore, we propose three techniques to compensate for low upload bandwidth in peer-to-peer games: focus sets, pairwise rapid agreement, and guidable AI. To test these techniques, we implement them and conduct a user study that evaluates the resulting game. We find that our techniques make a game played with low bandwidth significantly more fun than existing techniques, and nearly as much fun as one played on a LAN. Thus, they enable an order of magnitude more players than existing techniques.
|
A Five-Year Study of File-System Metadata.
Proceedings of the USENIX Conference on File and Storage Technologies (FAST), Awarded "2021 Test of Time Award", February 2007. abstract
For five years, we collected annual snapshots of filesystem metadata from over 60,000 Windows PC file systems in a large corporation. In this paper, we use these snapshots to study temporal changes in file size, file age, file-type frequency, directory size, namespace structure, file-system population, storage capacity and consumption, and degree of file modification. We present a generative model that explains the namespace structure and the distribution of directory sizes. We find significant temporal trends relating to the popularity of certain file types, the origin of file content, the way the namespace is used, and the degree of variation among file systems, as well as more pedestrian changes in sizes and capacities. We give examples of consequent lessons for designers of file systems and related software.
|
SubVirt: Implementing Malware with Virtual Machines.
Proceedings of the IEEE Symposium on Security and Privacy (S&P), May 2006. abstract
Attackers and defenders of computer systems both strive to gain complete control over the system. To maximize their control, both attackers and defenders have migrated to low-level, operating system code. In this paper, we propose a new type of malicious software which gains qualitatively more control over a system. This new type of malware, which we call a hypervirus, installs a virtual-machine monitor underneath an existing operating system and hoists the original operating system into a virtual machine. Hyperviruses are hard to detect and remove because their state cannot be accessed by software running in the target system. Further, hyperviruses support general-purpose malicious services by allowing such services to run in a separate operating system that is protected from the target system. We explore this new threat by implementing two prototype hyperviruses. We use our prototype hyperviruses to subvert Windows XP and Linux target systems, and we implement four example malicious services using the hypervirus platform. Last, we use what we learn from our prototype hyperviruses to explore ways to defend against this new threat.
|
The SMART Way to Migrate Replicated Stateful Services.
Proceedings of the European Conference on Computer Systems (Eurosys), April 2006. abstract slides
Many stateful services use the replicated state machine approach for high availability. In this approach, a service runs on multiple machines to survive machine failures. This paper describes SMART, a new technique for changing the set of machines where such a service runs, i.e., migrating the service. SMART improves upon existing techniques in three important ways. First, SMART allows migrations that replace non-failed machines. Thus, SMART enables load balancing and lets an automated system replace failed machines. Such autonomic migration is an important step toward full autonomic operation, in which administrators play a minor role and need not be available twenty-four hours a day, seven days a week. Second, SMART can pipeline concurrent requests, a useful performance optimization. Third, prior published migration techniques are described in insufficient detail to admit implementation, whereas our description of SMART is complete. In addition to describing SMART, we also demonstrate its practicality by implementing it, evaluating our implementation's performance, and using it to build a consistent, replicated, migratable file system. Our experiments demonstrate the performance advantage of pipelining concurrent requests, and show that migration has only a minor and temporary effect on performance.
|
PACE: A New Approach to Dynamic Voltage Scaling.
IEEE Transactions on Computers 53(7), July 2004. abstract
By dynamically varying CPU speed and voltage, it is possible to save significant amounts of energy while still meeting prespecified soft or hard deadlines for tasks; numerous algorithms have been published with this goal. We show that it is possible to modify any voltage scaling algorithm to minimize energy use without affecting perceived performance and present a formula to do so optimally. Because this formula specifies increased speed as the task progresses, we call this approach PACE (Processor Acceleration to Conserve Energy). This optimal formula depends on the probability distribution of the task's work requirement and requires that the speed be varied continuously. We therefore present methods for estimating the task work distribution and evaluate how effective they are on a variety of real workloads. We also show how to approximate the optimal continuous schedule with one that changes speed a limited number of times. Using these methods, we find we can apply PACE practically and efficiently. Furthermore, PACE is extremely effective: Simulations using real workloads and the standard model for energy consumption as a function of voltage show that PACE can reduce the CPU energy consumption of existing algorithms by up to 49.5 percent, with an average of 20.6 percent, without any effect on perceived performance. The consequent PACE-modified algorithms reduce CPU energy consumption by an average of 65.4 percent relative to no dynamic voltage scaling, as opposed to only 54.3 percent without PACE.
|
Using User Interface Event Information in Dynamic Voltage Scaling Algorithms.
Proceedings of the IEEE/ACM International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems (MASCOTS), October 2003. abstract slides
Increasingly, mobile computers use dynamic voltage scaling (DVS) to reduce CPU voltage and speed and thereby increase battery life. To determine how to change voltage and speed when responding to user interface events, we analyze traces of real user workloads. We evaluate a new heuristic for inferring when user interface tasks complete and find it is more efficient and nearly as effective as other approaches. We compare DVS algorithms and find that for a given performance level, the PACE algorithm uses the least energy and the Stepped algorithm uses the second least. We find that different types of user interface event (mouse movements, mouse clicks, and keystrokes) trigger tasks with significantly different CPU use, suggesting one should use different speeds for different event types. We also find differences in CPU use between categories of the same event type, e.g., between pressing spacebar and pressing enter, and between events of different applications. Thus, it is better to predict task CPU use based solely on tasks of the same category and application. However, energy savings from such improved predictions are small.
|
Operating System Modifications for Task-Based Speed and Voltage Scheduling.
Proceedings of the International Conference on Mobile Systems, Applications, and Services (MobiSys), May 2003. abstract slides
This paper describes RightSpeed, a task-based speed and voltage scheduler for Windows 2000. It takes advantage of the ability of certain processors, such as those from Transmeta and AMD, to dynamically change speed and voltage and thus to save energy while running more slowly. RightSpeed uses PACE, an algorithm that computes the most energy efficient way to meet task deadlines with high probability. Since most applications do not provide enough data about tasks, such as task deadlines, for PACE to work, RightSpeed uses simple and efficient heuristics to automatically detect task characteristics for such applications. RightSpeed has only 1.2\% background overhead and its operations take only a few microseconds each. It even performs PACE calculation, which is quite complicated, in only 4.4 us on average due to our extensive optimizations. RightSpeed is effective at meeting performance targets set by applications to within 1.5\%. Although the PACE calculator does not save energy for the current generation of processors due to their limited range of worthwhile speed and voltage settings, we expect future processors to have greater such ranges, enabling PACE to reduce CPU energy consumption by 6.1-8.7\% relative to the best standard algorithm. Furthermore, with PACE, giving a processor the ability to run at additional, higher speeds and voltages reduces overall energy consumption.
|
FARSITE: Federated, Available, and Reliable Storage for an Incompletely Trusted Environment.
Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), December 2002. abstract
Farsite is a secure, scalable file system that logically functions as a centralized file server but is physically distributed among a set of untrusted computers. Farsite provides file availability and reliability through randomized replicated storage; it ensures the secrecy of file contents with cryptographic techniques; it maintains the integrity of file and directory data with a Byzantine-fault-tolerant protocol; it is designed to be scalable by using a distributed hint mechanism and delegation certificates for pathname translations; and it achieves good performance by locally caching file data, lazily propagating file updates, and varying the duration and granularity of content leases. We report on the design of Farsite and the lessons we have learned by implementing much of that design.
|
Operating Systems Techniques for Reducing Processor Energy Consumption.
Ph.D. thesis, University of California, Berkeley, December 2001. abstract
In the last decade, limiting computer energy consumption has become a pervasive goal in computer design, largely due to growing use of portable and embedded computers with limited battery capacities. This work concerns ways to reduce processor energy consumption, since the processor consumes much of a computer's energy. Our specific contributions are as follows. \\ First, we introduce our thesis that operating systems should have a significant role in processor energy management. The operating system knows what threads and applications are running, and can predict their future requirements based on their past usage and their user interaction. We motivate using software to control energy management decisions by describing how software has traditionally been applied to this regime. Next, we describe operating system techniques for increasing processor sleep time. We suggest never running blocked processes, and delaying processes that execute without producing output or otherwise signaling useful activity. These techniques reduce CPU energy by 47--66\%. \\ Next, we address ways to dynamically change a processor's speed and voltage. We suggest considering what tasks the system is working on and their performance needs, then using a speed schedule that just meets those needs. We show that the optimal schedule increases speed as a task progresses according to a formula dependent on the probability distribution of task CPU requirement. Such a schedule can reduce CPU energy consumption by 20.6\% on average, with no effect on performance. \\ Next, we analyze real user workloads to evaluate ways to infer task information from observations of user interface events. We find that observable differences in such events have significant effects on CPU usage. Using such information in estimating the probability distribution of task CPU requirements can reduce energy consumption by a further 0.5--1.5\%. \\ Finally, we implement our methods. We deal with I/O wait time, overlap of multiple simultaneous tasks, limited speed/voltage settings, limited timer granularity, and limited ability to modify an operating system. The resulting task-based scheduler implements our energy-saving methods with 1.2\% background overhead. We find that our methods will be more effective on future processors capable of a wider range of speeds than modern processors.
|
Improving Dynamic Voltage Scaling Algorithms with PACE.
Proceedings of the ACM SIGMETRICS Conference, June 2001. abstract slides
This paper addresses algorithms for dynamically varying (scaling) CPU speed and voltage in order to save energy. Such scaling is useful and effective when it is immaterial when a task completes, as long as it meets some deadline. We show how to modify any scaling algorithm to keep performance the same but minimize expected energy consumption. We refer to our approach as PACE (Processor Acceleration to Conserve Energy) since the resulting schedule increases speed as the task progresses. Since PACE depends on the probability distribution of the task's work requirement, we present methods for estimating this distribution and evaluate these methods on a variety of real workloads. We also show how to approximate the optimal schedule with one that changes speed a limited number of times. Using PACE causes very little additional overhead, and yields substantial reductions in CPU energy consumption. Simulations using real workloads show it reduces the CPU energy consumption of previously published algorithms by up to 49.5\%, with an average of 20.6\%, without any effect on performance.
|
The VTrace Tool: Building a System Tracer for Windows NT and Windows 2000.
MSDN Magazine 15(10), October 2000. abstract code
This article describes the techniques used to construct VTrace, a system tracer for Windows NT and Windows 2000. VTrace collects data about processes, threads, messages, disk operations, network operations, and devices. The technique uses a DLL loaded into the address space of every process to intercept Win32 system calls; establishes hook functions for Windows NT kernel system calls; modifies the context switch code in memory to log context switches; and uses device filters to log accesses to devices.
|
A Comparison of File System Workloads.
Proceedings of the USENIX Annual Technical Conference (ATC), June 2000. abstract
In this paper, we describe the collection and analysis of file system traces from a variety of different environments, including both UNIX and NT systems, clients and servers, and instructional and production systems. Our goal is to understand how modern workloads affect the ability of file systems to provide high performance to users. Because of the increasing gap between processor speed and disk latency, file system performance is largely determined by its disk behavior. Therefore we primarily focus on the disk I/O aspects of the traces. We find that more processes access files via the memory-map interface than through the read interface. However, because many processes memory-map a small set of files, these files are likely to be cached. We also find that file access has a bimodal distribution pattern: some files are written repeatedly without being read; other files are almost exclusively read. We develop a new metric for measuring file lifetime that accounts for files that are never deleted. Using this metric, we find that the average block lifetime for some workloads is significantly longer than the 30-second write delay used by many file systems. However, all workloads show lifetime locality: the same files tend to be overwritten multiple times.
|
Energy Consumption of Apple Macintosh Computers.
IEEE Micro 18(6), November 1998. abstract
The utility of a portable computer is critically dependent on the period it can be used while running off the battery. In this paper, we present a study of power consumption in Apple Macintosh computers. We measure the existing power consumption for each system component using built-in measuring tools. Since total power consumption is a function of user workload, we use eight user workload traces to determine power use as observed in practice. Apple currently implements some simple power-saving features, and the effectiveness of those features is estimated; we find typical power savings of 41-66\%. After the use of basic power-saving techniques, we find that the major power users are the backlight (25-26\%), the CPU (9-25\%), the display (4- 17\%), the video circuitry (6-10\%), and the hard drive (4-9\%). We then evaluate possible changes in system hardware and software with regard to the power savings they might offer.
|
Software Strategies for Portable Computer Energy Management.
IEEE Personal Communications Magazine 5(3), June 1998. abstract
Limiting the energy consumption of computers, especially portables, is becoming increasingly important. Thus, new energy-saving computer components and architectures have been and continue to be developed. Many architectural features have both high performance and low power modes, with the mode selection under software control. The problem is to minimize energy consumption while not significantly impacting the effective performance. We group the software control issues as follows: transition, load-change, and adaptation. The transition problem is deciding when to switch to low-power, reduced-functionality modes. The load-change problem is determining how to modify the load on a component so that it can make further use of its low-power modes. The adaptation problem is how to create software that allows components to be used in novel, power-saving ways. We survey implemented and proposed solutions to software energy management issues created by existing and suggested hardware innovations.
|
Scheduling Techniques for Reducing Processor Energy Use in MacOS.
Wireless Networks 3(5), October 1997. abstract
The CPU is one of the major power consumers in a portable computer, and considerable power can be saved by turning off the CPU when it is not doing useful work. In Apple's MacOS, however, idle time is often converted to busy waiting, and generally it is very hard to tell when no useful computation is occurring. In this paper, we suggest several heuristic techniques for identifying this condition, and for temporarily putting the CPU in a low-power state. These techniques include turning off the processor when all processes are blocked, turning off the processor when processes appear to be busy waiting, and extending real time process sleep periods. We use trace-driven simulation, using processor run interval traces, to evaluate the potential energy savings and performance impact. We find that these techniques save considerable amounts of processor energy (as much as 66\%), while having very little performance impact (less than 2\% increase in run time). Implementing the proposed strategies should increase battery lifetime by approximately 20\% relative to Apple's current CPU power management strategy, since the CPU and associated logic are responsible for about 32\% of power use; similar techniques should be applicable to operating systems with similar behavior.
|
Reducing Processor Power Consumption by Improving Processor Time Management in a Single-User Operating System.
Proceedings of the ACM International Conference on Mobile Computing and Networking (MOBICOM), November 1996. abstract slides
The CPU is one of the major power consumers in a portable computer, and considerable power can be saved by turning off the CPU when it is not doing useful work. In Apple's MacOS, however, idle time is often converted to busy waiting, and generally it is very hard to tell when no useful computation is occurring. In this paper, we suggest several heuristic techniques for identifying this condition, and for temporarily putting the CPU in a low-power state. These techniques include turning off the processor when all processes are blocked, turning off the processor when processes appear to be busy waiting, and extending real time process sleep periods. We use trace-driven simulation, using processor run interval traces, to evaluate the potential energy savings and performance impact. We find that these techniques save considerable amounts of processor energy (as much as 66\%), while having very little performance impact (less than 2\% increase in run time). Implementing the proposed strategies should increase battery lifetime by approximately 20\% relative to Apple's current CPU power management strategy, since the CPU and associated logic are responsible for about 32\% of power use; similar techniques should be applicable to operating systems with similar behavior.
|
Making World Wide Web Caching Servers Cooperate.
Proceedings of the International World Wide Web Conference (WWW), December 1995. abstract
Due to its exponential growth, the World Wide Web is increasingly experiencing several problems, such as hot spots, increased network bandwidth usage, and excessive document retrieval latency. The standard solution to these problems is to use a caching proxy. However, a single caching proxy is a bottleneck: there is a limit to the number of clients that can use the same cache, and thereby the effectiveness of the cache is limited. Also, a caching proxy is a single point of failure. We address these problems by creating a protocol that allows multiple caching proxies to cooperate and share their caches, thus increasing robustness and scalability. This scalability, in turn, gives each client an effectively larger cache with a higher hit rate. This paper describes a prototype implementation of this protocol that uses IP multicast to communicate between the servers.
|
A Complete Picture of the Energy Consumption of a Portable Computer.
Master's thesis, University of California, Berkeley, December 1995. abstract
High battery lifetime is important to the usability and acceptance of portable computers. In order to develop strategies to minimize power consumption, designers need a good picture of the total power consumption of a system. For this purpose, we indicate the power consumptions of a set of portable computers and how they are broken down among the components of these computers. Then, we use user profiles to show how the use of power-saving features currently implemented serves to reduce these power consumptions by 35--56\%. We also show how these power-saving features affect the breakdown of overall power consumption, so that we can evaluate how successful certain new software techniques and hardware changes would be at reducing power consumption. The results of this paper point out the most promising avenues for further work in the reduction of power consumption, and indicate some strategies that can provide an immediate power benefit to the class of machines studied.
|
Can the Fractal Dimension of Images be Measured?.
Pattern Recognition 27(3), March 1994. abstract
Fractal dimension is a popular parameter for explaining certain phenomena and for describing natural textures. The problem of estimating the fractal dimension of a profile or an image is more difficult and devious than theory suggests. This paper studies the accuracy and robustness of two common estimators of fractal dimension (box counting and the variation method) using two types of data (Brownian and Takagi). Poor results are demonstrated from applying theory directly, called naive estimation. Data is then interpreted in the most optimistic way possible by matching the estimator to the known fractal dimension. Experiments quantify the effects of resolution, or fineness of sampling, and quantization, or rounding of sampled values. Increasing resolution enhances the estimators when true dimension, D, is large, but may, possibly due to quantization effect, degrade estimators when D is small. Quantization simply causes shifts in estimates. The results suggest that one should not place much reliance in the absolute value of a fractal estimate, but that the estimates do vary monotonically with D and might be useful descriptors in tasks such as image segmentation and description.
|