DSAs (Domain Specific Accelerators) such as GPUs are increasingly evolving and are pushing data processing efficiency and throughput. In order to utilize GPUs efficiently, various tools and frameworks (such as tf.data, DALI for deep learning and Thrust and ArrayFire for database operators) exist, with the goal of feeding the GPUs with data at a bandwidth high enough to saturate them during downstream processing.

On top of the software frameworks and tools, GPUs themselves are also being further developed with technologies such as high-bandwidth interconnects (NVLink) in order to overcome data transfer bottlenecks.

In this work, we seek to characterize the interplay between the data loading/processing frameworks and CPU-GPU cooperation through NVLink in order to first understand the utilization of each part, along with the potential for higher utilization of the interconnects within the frameworks.

Potential domains that would leverage this work include deep learning and databases, where data size scaling pose challenges for GPU acceleration on large data sizes due to the GPU memory capacity.
The work seeks to understand what impacts large-scale data processing has on co-processors and how these impacts are dealt with from both a software and hardware point-of-view.

Intended Learning Outcomes:
1. Reflect on the different frameworks and identify the most relevant ones, for further experimentation.
2. Identify the most relevant workloads in the context of the challenges outlined above.
3. Design an experimental workflow for measurements of relevant metrics.
4. Carry out experimental analysis of the different frameworks and tie findings to intricacies of the different tools.

Survey the different frameworks related to optimizing data loading/processing:
- Machine Learning: (for instance) CoorDL, tf.data, PyTorch DataLoader, DALI
- Database Operators: Thrust, ArrayFire

Setting up environment for testing and comparison of the tools/frameworks.
- For ML, this corresponds to having an environment where, potentially multiple, models can be trained with the aforementioned ML-related frameworks for data loading.
- For databases, this could be TPC-H with an increasing scaling factor that would surpass GPU memory capacity.

Obtain an overview over the hardware utilization within the different tools. It is likely that not all of them take all steps of the workload into account or utilize some of the latest technology embedded in the hardware.
- Measurement of metrics such as data processing throughput, GPU idle/stall time, query execution time, overlap between I/O and downstream training compute. Measurements are carried out wrt. utilization of the type of interconnect. Both types of workloads should manage data sizes larger than GPU memory.