Picture by Writer
There are lots of programs and assets accessible on machine studying and knowledge science, however only a few on knowledge engineering. This raises some questions. Is it a tough discipline? Is it providing low pay? Is it not thought of as thrilling as different tech roles? Nevertheless, the fact is that many corporations are actively in search of knowledge engineering expertise and providing substantial salaries, generally exceeding $200,000 USD. Information engineers play a vital function because the architects of knowledge platforms, designing and constructing the foundational programs that allow knowledge scientists and machine studying specialists to operate successfully.
Addressing this trade hole, DataTalkClub has launched a transformative and free bootcamp, “Information Engineering Zoomcamp“. This course is designed to empower rookies or professionals seeking to swap careers, with important expertise and sensible expertise in knowledge engineering.
It is a 6-week bootcamp the place you’ll be taught by means of a number of programs, studying supplies, workshops, and tasks. On the finish of every module, you can be given homework to observe what you have realized.
- Week 1: Introduction to GCP, Docker, Postgres, Terraform, and atmosphere setup.
- Week 2: Workflow orchestration with Mage.
- Week 3: Information warehousing with BigQuery and machine studying with BigQuery.
- Week 4: Analytical engineer with dbt, Google Information Studio, and Metabase.
- Week 5: Batch processing with Spark.
- Week 6: Streaming with Kafka.
Picture from DataTalksClub/data-engineering-zoomcamp
The syllabus comprises 6 modules, 2 workshops, and a venture that covers all the pieces wanted for changing into an expert knowledge engineer.
Module 1: Mastering Containerization and Infrastructure as Code
On this module, you’ll be taught in regards to the Docker and Postgres, beginning with the fundamentals and advancing by means of detailed tutorials on creating knowledge pipelines, working Postgres with Docker, and extra.
The module additionally covers important instruments like pgAdmin, Docker-compose, and SQL refresher subjects, with non-compulsory content material on Docker networking and a particular walk-through for Home windows subsystem Linux customers. In the long run, the course introduces you to GCP and Terraform, offering a holistic understanding of containerization and infrastructure as a code, important for contemporary cloud-based environments.
Module 2: Workflow Orchestration Methods
The module gives an in-depth exploration of Mage, an progressive open-source hybrid framework for knowledge transformation and integration. This module begins with the fundamentals of workflow orchestration, progressing to hands-on workout routines with Mage, together with setting it up by way of Docker and constructing ETL pipelines from API to Postgres and Google Cloud Storage (GCS), after which into BigQuery.
The module’s mix of movies, assets, and sensible duties ensures a complete studying expertise, equipping learners with the abilities to handle refined knowledge workflows utilizing Mage.
Workshop 1: Information Ingestion Methods
Within the first workshop you’ll grasp constructing environment friendly knowledge ingestion pipelines. The workshop focuses on important expertise like extracting knowledge from APIs and recordsdata, normalizing and loading knowledge, and incremental loading methods. After finishing this workshop, it is possible for you to to create environment friendly knowledge pipelines like a senior knowledge engineer.
Module 3: Information Warehousing
The module is an in-depth exploration of knowledge storage and evaluation, specializing in Information Warehousing utilizing BigQuery. It covers key ideas resembling partitioning and clustering, and dives into BigQuery’s finest practices. The module progresses into superior subjects, significantly the combination of Machine Studying (ML) with BigQuery, highlighting using SQL for ML, and offering assets on hyperparameter tuning, characteristic preprocessing, and mannequin deployment.
Module 4: Analytics Engineering
The analytics engineering module focuses on constructing a venture utilizing dbt (Information Construct Software) with an current knowledge warehouse, both BigQuery or PostgreSQL.
The module covers organising dbt in each cloud and native environments, introducing analytics engineering ideas, ETL vs ELT, and knowledge modeling. It additionally covers superior dbt options resembling incremental fashions, tags, hooks, and snapshots.
In the long run, the module introduces methods for visualizing remodeled knowledge utilizing instruments like Google Information Studio and Metabase, and it supplies assets for troubleshooting and environment friendly knowledge loading.
Module 5: Proficiency in Batch Processing
This module covers batch processing utilizing Apache Spark, beginning with introductions to batch processing and Spark, together with set up directions for Home windows, Linux, and MacOS.
It consists of exploring Spark SQL and DataFrames, making ready knowledge, performing SQL operations, and understanding Spark internals. Lastly, it concludes with working Spark within the cloud and integrating Spark with BigQuery.
Module 6: The Artwork of Streaming Information with Kafka
The module begins with an introduction to stream processing ideas, adopted by in-depth exploration of Kafka, together with its fundamentals, integration with Confluent Cloud, and sensible purposes involving producers and shoppers.
The module additionally covers Kafka configuration and streams, addressing subjects like stream joins, testing, windowing, and using Kafka ksqldb & Join. Moreover, it extends its focus to Python and JVM environments, that includes Faust for Python stream processing, Pyspark – Structured Streaming, and Scala examples for Kafka Streams.
Workshop 2: Stream Processing with SQL
You’ll be taught to course of and handle streaming knowledge with RisingWave, which supplies a cost-efficient resolution with a PostgreSQL-style expertise to empower your stream processing purposes.
Venture: Actual-World Information Engineering Software
The target of this venture is to implement all of the ideas we now have realized on this course to assemble an end-to-end knowledge pipeline. You’ll be creating to create a dashboard consisting of two tiles by deciding on a dataset, constructing a pipeline for processing the information and storing it in a knowledge lake, constructing a pipeline for transferring the processed knowledge from the information lake to an information warehouse, reworking the information within the knowledge warehouse and making ready it for the dashboard, and at last constructing a dashboard to current the information visually.
2024 Cohort Particulars
- Registration: Enroll Now
- Begin date: January 15, 2024, at 17:00 CET
- Self-paced studying with guided help
- Cohort folder with homeworks and deadlines
- Interactive Slack Neighborhood for peer studying
Stipulations
- Primary coding and command line expertise
- Basis in SQL
- Python: useful however not necessary
Skilled Instructors Main Your Journey
- Ankush Khanna
- Victoria Perez Mola
- Alexey Grigorev
- Matt Palmer
- Luis Oliveira
- Michael Shoemaker
Be a part of our 2024 cohort and begin studying with a tremendous knowledge engineering neighborhood. With expert-led coaching, hands-on expertise, and a curriculum tailor-made to the wants of the trade, this bootcamp not solely equips you with the mandatory expertise but additionally positions you on the forefront of a profitable and in-demand profession path. Enroll right this moment and rework your aspirations into actuality!
Abid Ali Awan (@1abidaliawan) is an authorized knowledge scientist skilled who loves constructing machine studying fashions. At the moment, he’s specializing in content material creation and writing technical blogs on machine studying and knowledge science applied sciences. Abid holds a Grasp’s diploma in Expertise Administration and a bachelor’s diploma in Telecommunication Engineering. His imaginative and prescient is to construct an AI product utilizing a graph neural community for college students combating psychological sickness.