If you work in data engineering, you're definitely acquainted with one of two terms: Airflow Data Pipeline or DAG. We all love graphs; they show a visual representation of something that may be difficult to understand otherwise. All tasks are organized logically, with defined operations occurring at predetermined intervals and explicit links with other tasks. Even better, they can convey a lot of information quickly and succinctly without being crowded. In this article, we will learn:
- What Is Apache Airflow, and How Airflow works?
- Terms and Features of Apache Airflow
- Everything About Directed Acyclic Graphs (DAG) and Operators.
- How to Create an Apache Airflow DAG
What Is Apache Airflow?
Apache Airflow is an open-source software MLOps and Data tool for modeling and running data pipelines. Airflow helps to write workflows as directed acyclic graphs (DAGs) of tasks. The airflow DAG runs on Apache Mesos or Kubernetes and gives users fine-grained control over individual tasks, including the ability to execute code locally.
The Airflow scheduler tells each task what to do without friction and without negotiating with other frameworks for CPU time, storage space, network bandwidth, or any other shared resources.
Key Components
Scalable: Airflow has a modular design and communicates with and orchestrates an arbitrary number of employees via a message queue.
Dynamic: Airflow pipelines are programmed in Python and may be generated dynamically. Users can write code that dynamically creates pipelines as a result of this feature.
Extensible: With Airflow, you can easily construct your own operators and executors and modify the library to meet your environment's degree of abstraction.
Elegant: Airflow pipes are simple and straightforward. The Jinja templating engine is incorporated in the heart of Airflow, allowing you to parameterize your scripts.
Apache Airflow Features
- Open source, simple to use, and well-supported by the community.
- You can create your workflows using Python; this gives you complete flexibility.
- A sophisticated and contemporary online application that allows you to monitor, schedule, and manage your processes.
- Provides a web interface to allow better visibility and administration.
Terms of Airflow
Directed Acyclic Graph (DAG)
To understand DAGs, we must first understand what directed graphs are. Directed Graphs, as the name suggests, have edges pointing towards nodes.
Directed Graphs are cyclic, as seen in the figure above, and you can see how they are looped together to form a cycle, whereas DAGs are acyclic. The main idea of Airflow is a DAG (Directed Acyclic Graph), which collects tasks and organizes them with dependencies and linkages to specify how they should execute.
A DAGRun is formed whenever a DAG is activated. A DAGrun may be thought of as a DAG instance with an execution timestamp. Let's look at an example to better understand how DAG works.
Recommended Reading: Learn more about DAGrun
Airflow DAG Example
Let's take a Simple DAG Example:
Here in the figure below, each task is represented by nodes among A to E.
- Node A could be downloading data
- Node B could be sending that data for processing
- Node C could be monitoring the process
- Node D could be getting insights
- Node E could be alerting the DAG’s Owner.
So, to obtain a thorough understanding of the procedure, you must first accomplish all of the tasks. For example, to successfully monitor and analyze data, you must first download and send the data for processing. Put another way; you must follow and finish the steps, which cannot be repeated or looped.
DAG Properties
- DAGs are defined in python files inside the Airflow DAG folder.
- Transitive closure and transitive reduction are defined differently in Directed Acyclic Graphs.
- Each node receives a string of IDs to use as labels for storing the calculated value.
- dag_id serves as a unique ID for the DAG.
- start_date tells you when your DAG should start.
- default_args is a dictionary of variables to be used as constructor keyword parameters when initializing operators.
Recommended Reading: Learn More About DAG
The Use of a Directed Acyclic Graph:
- The DAG determines the subexpressions that are often used.
- The names used within the block and the names computed outside the block are determined by the DAG.
- It provides the inputs and outputs of each arithmetic operation done inside the code, allowing the compiler to effectively eliminate similar subexpressions.
Recommended Reading - Developing a Protocol for Observational Comparative Effectiveness Research: A User's Guide.
Operator
While DAGs describe how to conduct a process, the operator is in charge of deciding what gets done. An operator describes a single task in a process. Operators may operate alone and do not require resources from other operators.
Operator Properties
- It defines the nature of the task and how it should be executed.
- When an operator is instantiated, the task becomes a node in DAG.
- It automatically retries in case of failures.
Operators play a crucial role in the airflow process. Operators come in a variety of types. The following are some of the most commonly used operators:
- BashOperator - Executes a bash command
- PythonOperator - Calls an arbitrary python function
- EmailOperator - sends an email
- MySqlOperator, SqliteOperator, PostgreOperator - Runs a SQL command.
- SimpleHttpOperator - sends an HTTP request
And many more, you can view the complete list of Airflow operators.
Types of Operators
- Action Operator is a program that performs a certain action (EmailOperator, BashOperator, etc.).
- Transfer Operator is responsible for moving data from one system to another.
- Sensor Operator waits for data to arrive at a defined path/location.
Now, let's take a look at how to create a dag in airflow.
Recommended Reading: 101 Guide on Apache Airflow Operators
Creating an Apache Airflow DAG
Installing Airflow
pip install apache-airflow
If you want to add additional packages to provide certain features, let say you want to add the Microsoft Azure subpackage, so use the following command:
pip install 'apache-airflow[azure]'
View Installation Guide to learn more.
Creating a DAG
Now it's time to save the code. Make sure to put it in the DAG folder.
This code (DAG) is stored in the dags subdirectory of our AIRFLOW HOME directory, and dagsfilename.py is the file's name.
………….
├── airflow-webserver.out
├── airflow-webserver.pid
├── airflow.cfg
├── airflow.db
├── airflow.db-journal
├── dags
│ │── __pycache__
│ │── dagsfilename.py
Running the DAG
To execute the DAG, use the following command to start the Airflow scheduler:
airflow scheduler
Now in the Airflow web server, you will see your DAG file name.
Now, you can toggle between the option and execute your DAG.
We successfully created the first DAG, and now you may play with a few other configurations and learn more about DAGs.
Benefits of Using Apache Airflow
The following are some of the benefits of Airflow over other MLOps tools.
Interface
Airflow makes it simple to troubleshoot jobs in production by providing quick access to the logs of each of the numerous tasks done through its web-ui. Its easy-to-use monitoring and management interface allows users to get a rapid overview of the progress of various activities and initiate and clear jobs or DAGs runs.
Dynamic Pipeline Generation
Configuration-as-code (Python) airflow pipelines enable dynamic pipeline creation. This makes it possible to write code that generates pipeline instances on the fly. From concept to model, you can productionize data pipelines with Apache Airflow.
Recommended Reading: How to Automate Data Pipelines with Apache Airflow?
Queries can be automated
There are a variety of operators set up in Airflow to run code. Airflow includes operators for most databases, and because it's written in Python, it features a PythonOperator that lets you easily deploy python code.
Retry policy built-in
It features a built-in auto-retry policy that may be customized. For example, you can use the function:
retries: The number of times a task may be attempted before failing.
retry_delays: Delay between retries in retry delays.
retry exponential backoff: Sets the exponential backoff between retries.
max retry delay: Maximum time between retries (timedelta).
Workflow monitoring
Airflow allows programmatic processes to be built up, and tasks, for instances may be produced on the fly inside a DAG. Sub-DAGs and XCom, on the other hand, allow for the creation of complicated dynamic processes.
For example, Dynamics DAGs may be built up depending on variables or connections defined in the Airflow UI.
Wrapping Up
Apache Airflow is a powerful tool for managing and orchestrating workflows. Whether you need to schedule data export, carry out nightly malware scans, or run a promotion campaign — Apache Airflow can handle it. It manages the complex, resource-intensive work of running applications, and these applications can be scheduled to run according to different runtimes or demand curves without manual intervention.
Airflow provides a series of features that you might find helpful in your data wrangling workflow. We talked about airflow, its benefits, and how DAGs function. I hope you enjoyed the article!
Explore how Censius helps you monitor, analyze and explain your ML models
Explore Platform