Presentations

Here are a few key links for this workshop:

Demos

Exploring Data Science with Arkouda: A Practical Introduction to Scalable Data Science

Presenter: Ben McDonald
SC Presentation
Short Paper
Demo Recording

Work-in-progress: CUDA Python object models and parallelism models

Presenter: Andy Terrel
SC Presentation

Seamlessly scale your python program from single CPU core to multi-GPU multi-node HPC cluster with cuNumeric

Presenter: Wonchan Lee, Manolis Papadakis, Mike Bauer, Bo Dong
SC Presentation

Visualizing Workflows with the Dragon Telemetry Service

Presenter: Indira Pimpalkhare, Colin Wahl, Maria Kalantzi
SC Presentation
Short Paper
Demo Recording

Accelerating Python Applications with Dask and ProxyStore

Presenter: J. Gregory Pauloski, Klaudiusz Rydzy, Valerie Hayot-Sasson, Ian Foster, Kyle Chard
SC Presentation
Short Paper
Demo Recording

PyOMP: Parallel programming for CPUs and GPUs with OpenMP and Python

Presenter: Giorgis Georgakoudis, Todd Anderson, Stuart Archibald, Bronis de Supinski, Timothy Mattson
SC Presentation
Short Paper
Demo Recording

Lightning Talks

Accelerated massive data analytics for semiconductors

Presenter: Quynh L. Nguyen

X-ray experiments at the Linac Coherent Light Source (LCLS), SLAC National Accelerator Laboratory, enables new scientific discoveries of matter. Accompanied challenges include extracting important insights from massive amount of data being generated at TeraBytes/hour for effective experiment-steering. This rate will increase to TBs/second with our newly commissioned LCLS-II/HE facilities. We developed functions in cuNumerics that are relevant for scientific computing and implemented them for live-analysis during an experiment. We found a 6x speed up as compared to our routine data analytics using Numpy. By using this new approach, we extract comprehensive information on material properties at higher efficiency.

In-Transit Machine Learning of Plasma Simulations on Exascale systems

Presenter: Vineeth Gutta

Traditional ML workflows use offline training where the data is stored on disk and is subsequently loaded into accelerator (CPU,GPU, etc) memory during training or inference. We recently devised a novel and scalable in-transit ML workflow for a plasma-physics application (chosen as 1 out of 8 compelling codes in the country) for the world’s fastest supercomputer, Frontier, with an aim to build a high-energy laser particle accelerator. This in-transit workflow solves the challenge of coupling full-scale particle-in-cell simulations with distributed ML training on PyTorch using DDP enabling the model to learn correlations between emitted radiation and particle dynamics within simulation in an unsupervised method. Simulations on Exascale systems create volumes of data that is infeasible to store on HPC file systems. A mismatch between modern memory hierarchies occurs due to high volume and rate of data generation. The workflow demonstrates use of data reduction combined with inversion using invertible neural networks to reconstruct the simulation. We use continuous learning where the data is consumed in batches as the simulation produces the data and then discards after each batch is trained. We demonstrate this at scale on Frontier using 400 AMD MI250X GPUs and show the flexibility of such workflows beyond the plasma simulation science case, opening up the possibility of running in-transit ML with other surrogate models and foundation models.