MLOps Tools
 • 
6
 minutes read

What are Directed Acyclic Graphs and How to Create Them Using Airflow

Understand the concept of Directed Acyclic Graphs in Apache Airflow and learn how to create your first DAG

By 
Harshil Patel
What are Directed Acyclic Graphs and How to Create Them Using Airflow
In this post:

If you work in data engineering, you're definitely acquainted with one of two terms: Airflow Data Pipeline or DAG. We all love graphs; they show a visual representation of something that may be difficult to understand otherwise. All tasks are organized logically, with defined operations occurring at predetermined intervals and explicit links with other tasks. Even better, they can convey a lot of information quickly and succinctly without being crowded. In this article, we will learn:

  • What Is Apache Airflow, and How Airflow works?
  • Terms and Features of Apache Airflow 
  • Everything About Directed Acyclic Graphs (DAG) and Operators. 
  • How to Create an Apache Airflow DAG

What Is Apache Airflow?

Apache Airflow is an open-source software MLOps and Data tool for modeling and running data pipelines. Airflow helps to write workflows as directed acyclic graphs (DAGs) of tasks. The airflow DAG runs on Apache Mesos or Kubernetes and gives users fine-grained control over individual tasks, including the ability to execute code locally.

The Airflow scheduler tells each task what to do without friction and without negotiating with other frameworks for CPU time, storage space, network bandwidth, or any other shared resources.

Image displaying Apache Airflow logo
Apache Airflow logo

Key Components 

Scalable: Airflow has a modular design and communicates with and orchestrates an arbitrary number of employees via a message queue.

Dynamic: Airflow pipelines are programmed in Python and may be generated dynamically. Users can write code that dynamically creates pipelines as a result of this feature.

Extensible: With Airflow, you can easily construct your own operators and executors and modify the library to meet your environment's degree of abstraction.

Elegant: Airflow pipes are simple and straightforward. The Jinja templating engine is incorporated in the heart of Airflow, allowing you to parameterize your scripts.

Apache Airflow Features

  • Open source, simple to use, and well-supported by the community.
  • You can create your workflows using Python; this gives you complete flexibility.
  • A sophisticated and contemporary online application that allows you to monitor, schedule, and manage your processes.
  • Provides a web interface to allow better visibility and administration.

Terms of Airflow

Directed Acyclic Graph (DAG)

To understand DAGs, we must first understand what directed graphs are. Directed Graphs, as the name suggests, have edges pointing towards nodes.  

An image depicting how a directed graph looks
Directed Graph
An image depicting how a directed acyclic graph looks like
Directed Acyclic Graph


Directed Graphs are cyclic, as seen in the figure above, and you can see how they are looped together to form a cycle, whereas DAGs are acyclic. The main idea of Airflow is a DAG (Directed Acyclic Graph), which collects tasks and organizes them with dependencies and linkages to specify how they should execute.

 A DAGRun is formed whenever a DAG is activated. A DAGrun may be thought of as a DAG instance with an execution timestamp. Let's look at an example to better understand how DAG works.

Recommended Reading: Learn more about DAGrun

Airflow DAG Example

Let's take a Simple DAG Example: 

Here in the figure below, each task is represented by nodes among A to E. 

  • Node A could be downloading data
  • Node B could be sending that data for processing
  • Node C could be monitoring the process
  • Node D could be getting insights
  • Node E could be alerting the DAG’s Owner.
Nodes representing each task in ml model development
Nodes representing each task

So, to obtain a thorough understanding of the procedure, you must first accomplish all of the tasks. For example, to successfully monitor and analyze data, you must first download and send the data for processing. Put another way; you must follow and finish the steps, which cannot be repeated or looped.

DAG Properties

  • DAGs are defined in python files inside the Airflow DAG folder.
  • Transitive closure and transitive reduction are defined differently in Directed Acyclic Graphs.
  • Each node receives a string of IDs to use as labels for storing the calculated value.
  • dag_id serves as a unique ID for the DAG.
  • start_date tells you when your DAG should start.
  • default_args is a dictionary of variables to be used as constructor keyword parameters when initializing operators.
Recommended Reading: Learn More About DAG

The Use of a Directed Acyclic Graph:

  • The DAG determines the subexpressions that are often used.
  • The names used within the block and the names computed outside the block are determined by the DAG.
  • It provides the inputs and outputs of each arithmetic operation done inside the code, allowing the compiler to effectively eliminate similar subexpressions.
Recommended Reading - Developing a Protocol for Observational Comparative Effectiveness Research: A User's Guide.

Operator

While DAGs describe how to conduct a process, the operator is in charge of deciding what gets done. An operator describes a single task in a process. Operators may operate alone and do not require resources from other operators.

Operator Properties

  • It defines the nature of the task and how it should be executed.
  • When an operator is instantiated, the task becomes a node in DAG.
  • It automatically retries in case of failures. 

Operators play a crucial role in the airflow process. Operators come in a variety of types. The following are some of the most commonly used operators:

  • BashOperator - Executes a bash command
  • PythonOperator - Calls an arbitrary python function
  • EmailOperator - sends an email
  • MySqlOperator, SqliteOperator, PostgreOperator - Runs a SQL command.
  • SimpleHttpOperator - sends an HTTP request

And many more, you can view the complete list of Airflow operators.


Types of Operators

  • Action Operator is a program that performs a certain action (EmailOperator, BashOperator, etc.).
  • Transfer Operator is responsible for moving data from one system to another.
  • Sensor Operator waits for data to arrive at a defined path/location.

Now, let's take a look at how to create a dag in airflow.

Recommended Reading: 101 Guide on Apache Airflow Operators

Creating an Apache Airflow DAG 

Installing Airflow

pip install apache-airflow

If you  want to add additional packages to provide certain features, let say you want to add the Microsoft Azure subpackage, so use the following command:

pip install 'apache-airflow[azure]'

View Installation Guide to learn more. 

Creating a DAG

#Step 1

from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator

#step 2

default_args = {
    'owner' : 'airflow',
    'depends_onpast' : False,
    'start_date' : datetime(2021,11,4),
    'retries':0
}

#step 3

dag = DAG(dag_id='DAG-1', default_args=default_args, catchup=False, schedule_interval='@once')

#step 4

start = DummyOperator(task_id='start', dag=dag)
end = DummyOperator(task_id='end', dag=dag)

#step 5

start >> end


Now it's time to save the code. Make sure to put it in the DAG folder.

This code (DAG) is stored in the dags subdirectory of our AIRFLOW HOME directory, and dagsfilename.py is the file's name.


………….
├── airflow-webserver.out
├── airflow-webserver.pid
├── airflow.cfg
├── airflow.db
├── airflow.db-journal
├── dags
│   │── __pycache__
│   │── dagsfilename.py

Running the DAG

To execute the DAG, use the following command to start the Airflow scheduler:

airflow scheduler

Now in the Airflow web server, you will see your DAG file name.

An image of the Airflow dashboard showing the name of the DAG created
Airflow web server displaying the created DAG file name

Now, you can toggle between the option and execute your DAG.

An image depicting the On toggle in Airflow dashboard that executes the DAG
The toggle ON option executing the DAG

We successfully created the first DAG, and now you may play with a few other configurations and learn more about DAGs.

Benefits of Using Apache Airflow

The following are some of the benefits of Airflow over other MLOps tools.

Interface 

Airflow makes it simple to troubleshoot jobs in production by providing quick access to the logs of each of the numerous tasks done through its web-ui. Its easy-to-use monitoring and management interface allows users to get a rapid overview of the progress of various activities and initiate and clear jobs or DAGs runs.

Dynamic Pipeline Generation

Configuration-as-code (Python) airflow pipelines enable dynamic pipeline creation. This makes it possible to write code that generates pipeline instances on the fly. From concept to model, you can productionize data pipelines with Apache Airflow.

Recommended Reading: How to Automate Data Pipelines with Apache Airflow?

Queries can be automated

There are a variety of operators set up in Airflow to run code. Airflow includes operators for most databases, and because it's written in Python, it features a PythonOperator that lets you easily deploy python code.

Retry policy built-in

It features a built-in auto-retry policy that may be customized. For example, you can use the function:

retries: The number of times a task may be attempted before failing.

retry_delays: Delay between retries in retry delays.

retry exponential backoff: Sets the exponential backoff between retries.

max retry delay: Maximum time between retries (timedelta).

Workflow monitoring

Airflow allows programmatic processes to be built up, and tasks, for instances may be produced on the fly inside a DAG. Sub-DAGs and XCom, on the other hand, allow for the creation of complicated dynamic processes.

For example, Dynamics DAGs may be built up depending on variables or connections defined in the Airflow UI.

Wrapping Up

Apache Airflow is a powerful tool for managing and orchestrating workflows. Whether you need to schedule data export, carry out nightly malware scans, or run a promotion campaign — Apache Airflow can handle it.  It manages the complex, resource-intensive work of running applications, and these applications can be scheduled to run according to different runtimes or demand curves without manual intervention.

Airflow provides a series of features that you might find helpful in your data wrangling workflow. We talked about airflow, its benefits, and how DAGs function. I hope you enjoyed the article!

Liked the content? You'll love our emails!

The best MLOps and AI Observability content handpicked and delivered to your email twice a month

Subscribe
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Censius AI Monitoring Platform
Automate ML Model Monitoring

Explore how Censius helps you monitor, analyze and explain your ML models

Explore Platform

Censius automates model monitoring

so that you can 

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

Start Monitoring
Subscribe to our newsletter
No BS, only good ML content