Description

Data Engineering is the foundation of data science and lays the groundwork for analysis and modeling. In order for organizations to extract knowledge and insights from structured and unstructured data, fast access to accurate and complete datasets is critical. Working with massive amounts of data from disparate sources requires complex infrastructure and expertise. Minor inefficiencies can result in major costs, both in terms of time and money, when scaled across millions to trillions of data points.In this workshop, we’ll explore how GPUs can improve data pipelines and how using advanced data engineering tools and techniques can result in significant performance acceleration. Faster pipelines produce fresher dashboards and machine learning (ML) models, so users can have the most current information at their fingertips

Objectives

In this workshop, you will learn:

How data moves within a computer. How to build the right balance between CPU, DRAM, Disk Memory, and GPUs. How different file formats can be read and manipulated by hardware.
How to scale an ETL pipeline with multiple GPUs using NVTabular.
How to build an interactive Plotly dashboard where users can filter on millions of data points in less than a second

Prerequisites

Intermediate knowledge of Python (List comprehension, objects)
Familiarity with pandas a plus >Introductory statistics (mean, median, mode)

Outline

Introduction (15 mins)

Meet the instructor.
Create an account at courses.nvidia.com/join

Data on the Hardware Level (60 mins)

Explore the strengths and weaknesses of different hardware approaches to data and the frameworks that support them:

Pandas
CuDF
Dask

ETL with NVTabulars (120 mins)

Learn how to scale an ETL pipeline from 1 GPU to many with NVTabular through the perspective of a big data recommender system.

Transform raw json into analysis-ready parquet files
Learn how to quickly add features to a dataset, such as Categorify and Lambda operators

Data Visualization (120 mins)

Step into the shoes of a meteorologist and learn how to plot precipitation data on a map. >Learn how to use descriptive statistics and plots like histograms in order to assess data quality

Learn effective memory usage, so users can quickly filter data through a graphical interface

Final Project: Data Detective (60 mins)

Users are complaining that the dashboard is too slow. Apply the techniques learned in class to find and eliminate efficiencies in the backend code.

Final Review (15 mins)

Review key learnings and answer questions
Complete the assessment and earn your certificate
Complete the workshop survey
Learn how to set up your own AI application development environmentNext Steps Continue learning with these DLI trainings:
Fundamentals of Accelerated Computing with CUDA Python >Fundamentals of Accelerated Data Science
High-Performance Computing with Containers

Accelerating Data Engineering Pipelines (ADEP)

Schedule

Private Class

Live Classroom

Virtual Classroom

Private Class

Guaranteed to Run

Course Summary

Show All