Powered by OpenAIRE graph

CIATEQ

2 Projects, page 1 of 1
  • Funder: UK Research and Innovation Project Code: EP/P031617/1
    Funder Contribution: 96,598 GBP

    Distributed systems are the essential elements that form the foundation for Internet infrastructure, and are critical for fulfilling the technological and societal needs of the digital age. Comprising Cloud datacenters, compute clusters, and the Internet of Things, these systems are responsible for the effective provisioning and execution of a multitude of parallelizable applications. The increased complexity and scale of these systems has resulted in the manifestation of emergent phenomena that substantially degrades overall system performance, and cannot be solved by simply increasing the number of compute nodes. This phenomena is known as The Long Tail Problem, whereby a small proportion of task stragglers - a small subset of tasks that execute abnormally slow - impede overall job completion time, and is systemic to all distributed systems that operate at sufficient scale. While work within this area attempts to address this problem through straggler detection or mitigation, their effectiveness is underpinned by understanding the precise underlying causes for straggler manifestation, and importantly determining what system conditions influence their occurrence. However achieving this understanding is incredibly challenging given the multitude of possible straggler root-causes - all of which can stem from diverse sub-system operational characteristics and their interactions with other sub-systems. As current understanding of straggler manifestation is restricted to a qualitative and high-level detail, it is presently impossible to determine what system operational conditions (e.g. cluster resource contention, temperature, failures) are highly likely to create a "perfect storm" for straggler occurrence. Determining the system conditions which influence the probability of straggler occurrence in different operational scenarios is vital towards achieving predictable and rapid parallel application execution, given the continued increase of system size and complexity. The vision of this proposed research is to address our limited understanding of straggler manifestation and conduct in-depth analysis and modelling of Internet-based distributed systems to quantify the precise relationship between straggler occurrence and system behaviour. This study will involve analysis and modelling stragglers within real systems, performed through comprehensive experimentation to identify and extract key system parameters from virtual and physical sub-system operation across the entire distributed system architecture. A framework will be constructed capable of automated analysis to determine straggler root-cause within production systems, which will interface with an event-based simulation engine for determining the optimal system conditions for avoiding stragglers. By working with leading international industrialists in massive-scale distributed systems, this work represents a significant step change towards solving The Long Tail Problem by providing much sought-out knowledge to truly understand straggler manifestation. As this problem is systemic across every type of large-scale distributed system, the impact of this work will have far reaching implications for both academia and industry, and will provide direct benefit to the competitiveness of the UKs digital economy within the short and long-term. This grant represents the first step towards realizing the research ambitious to scientifically understanding the operation of massive-scale Internet infrastructure, enabling the design of fault-tolerant techniques for future systems at unprecedented scale - a crucial objective towards realizing key emergent technologies for the future.

    more_vert
  • Funder: UK Research and Innovation Project Code: EP/V007092/1
    Funder Contribution: 1,167,040 GBP

    ICT now consumes approximately 10% of global electricity, with large-scale ICT systems such as Cloud datacentres, IoT, and HPC systems generating a substantial ICT footprint in terms of energy consumption and GHG emissions, and are growing contributors to climate change. Researchers across Computer Science and various engineering disciplines have predominantly tackled this problem via enhancing the energy-efficiency of individual components (software, servers, networking, cooling) via improvements to scheduling, software optimisation, hardware, and cooling. However, enhancing system component efficiency has still resulted in a growing global ICT footprint - more data, greater compute ability, and more devices. This is due to the rebound effect, whereby technological progress enhances system efficiency, however increases the rate of consumption and end-use demand. This is of increasing concern given the end of Moore's law, growing global digital service consumption, and the rise of Big Data and AI services in society - all when combined result in a rapidly increasing ICT footprint. It is no longer possible to rely on the conventional perception that 'green' large-scale ICT systems can be achieved just by solely improving component energy-efficiency. There needs to focused effort to actually reverse the global ICT footprint. We believe that this problem is not insurmountable however, yet requires a radical rethink how large-scale ICT systems are designed and operate. A system's ICT footprint is a by-product of its operation; we propose to inverse this dynamic - whereby system operation is instead a by-product of, and directly dictated by, its ICT footprint. What is required isn't greater efficiency, but instead precise control over how ICT systems operate and respond to energy levels and footprint targets; a significant research challenge given the sheer scale and complexity in understanding the relationship between ICT footprint manifestation, component interactions, and the impact of organisational sustainability practises. This challenge is further compounded by potential organisational resistance who may champion commercial profits over environment concerns. However, overcoming this challenge would allow ICT systems operation to be directly matched to energy generated from renewable sources, adhere to a specified GHG emission targets defined at organisational or national level, or dynamically align with an organisation's commercial targets or OpEx restrictions. This fellowship will design a large-scale ICT system capable of self-adapting its operation in response to energy availability and ICT footprint targets. This specifically entails: (1) Studying of causes of ICT footprint manifestation within technology organisations, and understand the rationale and impact of enacting sustainability practises. (2) Determine and model the precise relationship between complex ICT component interactions and resultant ICT footprint. (3) Design a self-adaptive framework that coordinates ICT energy-efficient decision making holistically. (4) Create a holistic resource manager underpinned by energy availability and ICT footprint targets. This fellowship is backed by a consortium of industrial and academic Computer Science and sustainability collaborators in the UK and beyond, and will be underpinned by considerable empirical analysis and experimentation in both production and laboratory CPU/GPU-based datacentre and HPC systems. Findings from this fellowship are potentially ground breaking towards designing future digital infrastructure in the face of environmental change. Our key outcomes include: - Reducing ICT system energy use between 25-50% with no software performance penalty. - Demonstrating the feasibility to reverse global ICT footprint growth via unshackling system operation from the rebound effect. - Releasing the largest in-depth operational and energy data from real-world ICT systems.

    more_vert

Do the share buttons not appear? Please make sure, any blocking addon is disabled, and then reload the page.

Content report
No reports available
Funder report
No option selected
arrow_drop_down

Do you wish to download a CSV file? Note that this process may take a while.

There was an error in csv downloading. Please try again later.