Data Engineering Nanodegree - Part 4 - Data Pipelines With Airflow¶

Data Engineering¶

Data Pipelines With Airflow¶

Module 1¶

Data Pipelines¶

Q: What is a data pipeline?

Answer: A series of steps in which data is processed. Example:

Automated marketing emails
Real-time pricing in rideshare apps
Targeted advertising based on browsing history

Remarks: It's typically using either ETL or ELT.

Q: Given an example of a data pipeline to accomplish the following: a bikeshare company wants to efficiently improve availability of bikes.

Answer:

Load application event data from a source such as S3 or Kafka
Load the data into an analytic warehouse such as RedShift.
Perform data transformations that identify high-traffic bike docks so the business can determine where to build additional locations.

Q: Describe the difference between ETL and ELT.

Answer:

ETL is normally a continuous, ongoing process with a well-defined workflow. ETL first extracts data from homogeneous or heterogeneous data sources. Then, data is cleansed, enriched, transformed, and stored either back in the lake or in a data warehouse.
ELT (Extract, Load, Transform) is a variant of ETL wherein the extracted data is first loaded into the target system. Transformations are performed after the data is loaded into the data warehouse.

Remarks: ELT typically works well when the target system is powerful enough to handle transformations. Analytical databases like Amazon Redshift and Google BigQ.

Q: What is S3?

Answer:

Amazon S3 has a simple web services interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web.
It gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites.

Remarks:

Source: Amazon Web Services Documentation.
If you want to learn more, start here.

Q: What is Kafka?

Answer:

Apache Kafka is an open-source stream-processing software platform developed by Linkedin and donated to the Apache Software Foundation, written in Scala and Java.
The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

Remarks:

Its storage layer is essentially a massively scalable pub/sub message queue designed as a distributed transaction log, making it highly valuable for enterprise infrastructures to process streaming data."
Source: Wikipedia
If you want to learn more, start here.

Q: What is RedShift?

Answer: Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Remarks:

The first step to create a data warehouse is to launch a set of nodes, called an Amazon Redshift cluster.
After you provision your cluster, you can upload your data set and then perform data analysis queries.
Regardless of the size of the data set, Amazon Redshift offers fast query performance using the same SQL-based tools and business intelligence applications that you use today.
If you want to learn more, start here.

Q: What is data validation?

Answer: It is the process of ensuring that data is present, correct & meaningful. Remarks: Ensuring the quality of your data through automated validation checks is a critical step in building data pipelines at any organization.

Q: List some data validation steps for the bikeshare example.

Answer: After loading from S3 to Redshift:

Validate the number of rows in Redshift match the number of records in S3

Once location business analysis is complete:

Validate that all locations have a daily visit average greater than 0
Validate that the number of locations in our output table match the number of tables in the input table.

Q: Are there real world cases where a data pipeline is not a DAG?

Answer:

It is possible to model a data pipeline that is not a DAG, meaning that it contains a cycle within the process.
However, the vast majority of use cases for data pipelines can be described as a directed acyclic graph (DAG).
This makes the code more understandable and maintainable.

Q: Can we have two different pipelines for the same data and can we merge them back together?

Answer: Yes. It's not uncommon for a data pipeline to take the same dataset,
perform two different processes to analyze it, then merge the
results of those two processes back together.

Q: How could the bikeshare example be modeled as a DAG?

Answer:

Q: What are the five components of airflow?

Answer:

The scheduler orchestrates the execution of jobs on a trigger or schedule. It chooses how to prioritize the running and execution of tasks within the system.
The work queue is used by the scheduler in most Airflow installations to deliver tasks that need to be run to the workers.
Worker processes execute the operations defined in each DAG. In most Airflow installations, workers pull from the work queue when it is ready to process a task. When the worker completes the execution of the task, it will attempt to process more work from the work queue until there is no further work remaining.
Database saves credentials, connections, history, and configuration. The database, often referred to as the metadata database, also stores the state of all tasks in the system.
The web Interface provides a control dashboard for users and maintainers. Throughout this course you will see how the web interface allows users to perform tasks such as stopping and starting DAGs, retrying failed tasks, configuring credentials,

Example:

Q: Describe the order of operations for an Airflow DAG.

Answer:

The scheduler starts DAGs based on time or external triggers.
Once a DAG is started, the scheduler looks at the steps within the DAG and determines which steps can run by looking at their dependencies.
The scheduler places runnable steps in the queue.
Workers pick up those tasks and run them.
Once the worker has finished running the step, the final status of the task is recorded and additional tasks are placed by the scheduler until all tasks are complete.
Once all tasks have been completed, the DAG is complete.

Example:

Q: What does the airflow scheduler do?

Answer: It starts DAGs based on triggers or schedules and moves them towards completion.

Q: What do the Airflow workers do?

Answer: They run and record the outcome of individual pipeline tasks.

Q: How to create a DAG in Airflow?

Answer: Creating a DAG is easy.

Give it a name,
a description,
a start date, and an
interval.

Example:

Q: What are operators in Airflow?

Answer: They define the atomic steps of work that make up a DAG. Example:

Remarks: Instantiated operators are referred to as Tasks.

Q: What does Airflow do when the start date is in the past?

Answer: Airflow will run your DAG as many times as there are schedule intervals between that start date and the current date.

Q: What are the nodes and edges in an Airflow DAG?

Answer:

Nodes = Tasks
Edges = Ordering and dependencies between tasks

Remarks:

Task dependencies can be described programmatically in Airflow using >> and <<

a >> b means a comes before b
a << b means a comes after b

Q: How to programmatically describe task dependencies in Airflow?

Answer:

Task dependencies can be described programmatically in Airflow using >> and <<

a >> b means a comes before b
a << b means a comes after b

Example: ```
hello_world_task = PythonOperator(task_id=’hello_world’, ...)
goodbye_world_task = PythonOperator(task_id=’goodbye_world’, ...)
```

Use `>>` to denote that `goodbye_world_task` depends on `hello_world_task`
```hello_world_task >> goodbye_world_task```

Q: What are Airflow Hooks?

Answer: Hooks provide a reusable interface to external systems and databases. Example:

Remarks: With hooks, you don’t have to worry about how and where to store these connection strings and secrets in your code.

Q: What's the purpose of runtime variables in Airflow?

Answer: They allow users to “fill in the blank” with important runtime variables for tasks. Example:

Remarks: Here is the Apache Airflow documentation on context variables that can be included as kwargs.

Q: What is a Task in Airflow?

Answer: An instantiated step in a pipeline fully parameterized for execution.

Module 2¶

Data Quality¶

Q: What is data lineage?

Answer: The data lineage of a dataset describes the discrete steps involved in the creation, movement, and calculation of that dataset.

Q: Why is data lineage important?

Answer:

Instilling Confidence: Being able to describe the data lineage of a particular dataset or analysis will build confidence in data consumers (engineers, analysts, data scientists, etc.) that our data pipeline is creating meaningful results using the correct datasets.
Defining Metrics: Another major benefit of surfacing data lineage is that it allows everyone in the organization to agree on the definition of how a particular metric is calculated.
Debugging: Data lineage helps data engineers track down the root of errors when they occur. If each step of the data movement and transformation process is well described, it's easy to find problems when they occur.

Remarks:

In general, data lineage has important implications for a business.
Each department or business unit's success is tied to data and to the flow of data between departments.
For e.g., sales departments rely on data to make sales forecasts, while at the same time the finance department would need to track sales and revenue.
Each of these departments and roles depend on data, and knowing where to find the data.
Data flow and data lineage tools enable data engineers and architects to track the flow of this large web of data.

Q: Why use schedules in airflow?

Answer:

Pipeline schedules can reduce the amount of data that needs to be processed in a given run. It helps scope the job to only run the data for the time period since the data pipeline last ran.
Using schedules to select only data relevant to the time period of the given pipeline execution can help improve the quality and accuracy of the analyses performed by our pipeline.
Running pipelines on a schedule will decrease the time it takes the pipeline to run.
An analysis of larger scope can leverage already-completed work. For example, if the aggregates for all months prior to now have already been done by a scheduled job, then we only need to perform the
aggregation for the current month and add it to the existing totals.

Q: Which factors should be considered in selecting a time period for scheduling?

Answer:

What is the size of data, on average, for a time period?
If an entire years worth of data is only a few kb or mb, then perhaps
its fine to load the entire dataset. If an hours worth of data is
hundreds of mb or even in the gbs then likely you will need to schedule
your pipeline more frequently.

How frequently is data arriving, and how often does the analysis need to be performed?
If our bikeshare company needs trip data every hour, that will be a
driving factor in determining the schedule. Alternatively, if we have to
load hundreds of thousands of tiny records, even if they don't add up
to much in terms of mb or gb, the file access alone will slow down our
analysis and we’ll likely want to run it more often.

What's the frequency on related datasets? A good
rule of thumb is that the frequency of a pipeline’s schedule should be
determined by the dataset in our pipeline which requires the most
frequent analysis. This isn’t universally the case, but it's a good
starting assumption. For example, if our trips data is updating every
hour, but our bikeshare station table only updates once a quarter, we’ll
probably want to run our trip analysis every hour, and not once a
quarter.

Q: What is the scope of a data pipeline?

Answer: It is the time delta between the current execution time and the end time of the last execution.

Q: What is data partitioning?

Answer: It is the process of isolating data to be analyzed by one or more attributes, such as time, logical type, or data size. Remarks: It often leads to faster and more reliable pipelines.

Q: Why should you use data partitioning?

Answer:

Pipelines designed to work with partitioned data fail more gracefully.
Smaller datasets, smaller time periods, and related concepts are easier to debug than big datasets, large time periods, and unrelated concepts.
Partitioning makes rerunning failed tasks much simpler. It also enables easier redos of work, reducing cost and time.

Remarks:

Another great thing about Airflow is that if your data is partitioned appropriately, your tasks will naturally have fewer dependencies on each other.
Because of this, Airflow will be able to parallelize execution of your DAGs to produce your results even faster.

Q: What are four common types of data partitioning?

Answer:

Location
Logical
Size
Time

Q: What is logical partitioning?

Answer: The process of breaking conceptually related data into discrete groups for processing.

Q: What is time partitioning?

Answer: It is the process of processing data based on a schedule or when it was created.

Q: What is size partitioning?

Answer: It is the process of separating data for processing based on desired or required storage limits.

Q: How would you set a requirement for ensuring that data arrives within a certain timeframe of a DAG starting?

Answer: Using a service level agreement (SLA).

Module 3¶

Production Data Pipelines¶

Q: What are the most common types of user-created plugins?

Answer: Operators and Hooks.

Q: How to create a custom operator?

Answer:

Identify Operators that perform similar functions and can be consolidated
Define a new Operator in the plugins folder
Replace the original Operators with your new custom one, re-parameterize, and instantiate them.

Q: What is Airflow contrib and why is it useful?

Answer:

It contains Operators and hooks for common data tools like Apache Spark and Cassandra, as well as vendor specific integrations for Amazon Web Services, Azure, and Google Cloud Platform can be found in Airflow contrib.
Therefore you should always check it before building your own airflow plugins to see if what you need already exists.

Remarks: If the functionality exists and its not quite what you want, that’s a great opportunity to add that functionality through an open source contribution.

Q: Which three rules should you follow when designing DAGs?

Answer:

DAG tasks should be designed such that they:

are Atomic and have a single purpose

Maximize parallelism

Make failure states obvious

Remarks:

Every task in your dag should perform only one job.

“Write programs that do one thing and do it well.” - Ken Thompson’s Unix Philosophy

Q: What are the benefits of task boundaries?

Answer:

Re-visitable: Task boundaries are useful for you if you revisit a pipeline you wrote after a 6 month absence. You'll have a much easier time understanding how it works and the lineage of the data if the boundaries between tasks are clear and well defined. This is true in the code itself, and within the Airflow UI.
Tasks that do just one thing are often more easily parallelized. This parallelization can offer a significant speedup in the execution of our DAGs.

Q: What are benefits of SubDAGs?

Answer:

Decrease the amount of code we need to write and maintain to create a new DAG
Easier to understand the high level goals of a DAG
Bug fixes, speedups, and other enhancements can be made more quickly and distributed to all DAGs that use that SubDAG

Q: What are drawbacks of using SubDAGs?

Answer:

Limit the visibility within the Airflow UI
Abstraction makes understanding what the DAG is doing more difficult
Encourages premature optimization

Q: Which pipeline monitoring options does airflow provide?

Answer:

SLAs
Emails and alerts
Metrics

Q: What's the purpose of SLAs in airflow?

Answer:

SLAs define the time by which a DAG must complete.
For time-sensitive applications they are critical for developing trust amongst your pipeline customers and ensuring that data is delivered while it is still meaningful.
Slipping SLAs can also be early indicators of performance problems, or a need to scale up the size of your Airflow cluster

Acronyms¶

SLA: Service Level Agreement

DAG: Directed Acyclic Graph