Integrating scale out and fault tolerance in stream processing using operator state management

Castro Fernandez, Raul and Migliavacca, Matteo and Kalyvianaki, Evangelia and Pietzuch, Peter (2013) Integrating scale out and fault tolerance in stream processing using operator state management. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. MOD International Conference on Management of Data . ACM, New York, USA, pp. 725-736. ISBN 978-1-4503-2037-5. (doi:10.1145/2463676.2465282) (Access to this publication is currently restricted. You may be able to access a copy if URLs are provided) (KAR id:36276)

PDF Author's Accepted Manuscript Language: English Restricted to Repository staff only

Official URL: http://dx.doi.org/10.1145/2463676.2465282

Abstract

As users of "big data" applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the "pay-as-you-go" model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs-systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results.

Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the checkpointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures.

Item Type:	Book section
DOI/Identification number:	10.1145/2463676.2465282
Additional information:	CW contacted publisher to request permission for FT 20/01/14.
Subjects:	Q Science > QA Mathematics (inc Computing science) > QA 76 Software, computer programming,
Institutional Unit:	Schools > School of Computing
Former Institutional Unit:	Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing
Depositing User:	Matteo Migliavacca
Date Deposited:	13 Nov 2013 13:10 UTC
Last Modified:	28 Apr 2026 07:59 UTC
Resource URI:	https://kar.kent.ac.uk/id/eprint/36276 (The current URI for this page, for reference purposes)

University of Kent Author Information

Migliavacca, Matteo.

Creator's ORCID:	https://orcid.org/0000-0002-5684-4865
CReDIT Contributor Roles:

Depositors only (login required):

Altmetric

Total Views

Total unique views of this page since July 2020. For more details click on the image.