Loading
Machine learning (ML) powered technologies are shaping many aspects of our daily lives. These technologies rely on big datasets, which are processed iteratively to build a knowledge model. During the training process, a model and associated computational state are stored in CPU-attached or on-board accelerator DRAM memories. To solve increasingly complex problems in the coming decade, we expect model sizes and associated DRAM memory requirements to scale and increase by multiple orders of magnitude as large models are typically associated with better accuracy. However, the DRAM technology is not improving at the rate needed to accommodate the large-model training demands of the next decade (capacity, cost, and density) due to various manufacturing limitations and software overheads. Furthermore, DRAM-based training is very energy inefficient and expensive, thus, putting large-scale model training out of reach for many scientists and users. If we are to sustain innovation in and through ML technologies, we urgently need an alternative system design. In this project, I propose to use an emerging storage technology based on Non-Volatile Memory (NVM) to meet the memory requirements of large-model training. NVM technologies such as flash and Optane, use physical properties of a material to store data, and are significantly more cost- and energy-efficient than DRAM. Hence, I propose a novel paradigm, “ML-from-Storage” (MLS), that combines DRAM with a distributed system of NVMs interconnected with high-performance networks to deliver unprecedented efficiency benefits for large-model ML training by storing models and states in NVM devices. The crucial part of this new paradigm is an entirely new software stack, whose three critical challenges I propose to address in this project: (1) efficiently storing ML state on NVM devices by building a customized, ML-specific storage stack. I will leverage emerging Open-Channel SSD devices, and make critical conceptual contributions in the area of efficient data layout, placement and serialization processes; (2) end-to-end timely state accesses in a distributed setting by building a global timeslot-based scheduling framework. I will address the conceptual challenges of unifying network and NVM I/O scheduling for ML, which has never been done in a distributed storage setting; Lastly (3) a ML-compiler (Apache TVM) driven optimization and scheduling policy space exploration using a ML-based cost model, feeding policies to (1) and (2) in practice. I will ensure the Machine Learning from Storage (MLS) project will translate its conceptual and technical contributions into impactful, usable software, demonstrated with applicability in ML-driven Biomedical image analysis by training large-models on high-resolution whole-slide images (WSI) of histological lymph node sections, thus helping pathologists with accurate, and timely diagnosis. Short-term results from MLS will enable us to perform cost and energy-efficient large-model training, thus democratizing ML for all. Long-term, scientific results from this work would enable us to fundamentally rethink the split nature of memory-storage hierarchy in computing, and the possibility of a unified data abstraction over emerging storage and networking technologies in multiple domains such as autonomous driving, logistics, and environmental sensing. To achieve all these goals, I will leverage my unique expertise in multiple disciplines of NVM storage, high-performance networking, end-to-end performance designs, and building customized distributed storage services.
<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=nwo_________::37e5a27a07d178e573ca6370eeadfb2b&type=result"></script>');
-->
</script>