Apache Airflow is a tool for automating workflows, tasks, and orchestration of other programs on clusters of computers. Airflow empowers organizations with its simple rules-based language that allows for complex data processing to be coded in minutes. We'll learn about airflow operators in this post, which you can use to create your own pipelines.
Operators carry out the instructions contained in your script or workflow description file (e.g., .py, .json). There are several Airflow operators that can help you achieve your goals. However, it can be challenging to understand the behavior of these operators without having a good conceptual understanding of Airflow itself.
What are Apache Airflow Operators?
Apache Airflow is an open-source MLOps and Data tool for modeling and running data pipelines. Airflow Operators are commands executed by your DAG each time an operator task is triggered during a DAG run. In general, anytime an operator task has been completed without generating any results, you should employ tasks sparingly since they eat up CPU time and increase the delay.
Recommended Reading: How to Automate Data Pipelines with Airflow?
If your DAG is executing steadily, tasks can be an easy way to solve a problem. However, you need to know how operators interact and where to use them for best results. In simple terms, when you create operator objects, you'll generate tasks.
Recommended Reading: What are DAGs?
If you want some data processed as quickly as possible and don't need the results right away but instead need the output of that data as part of analysis or workflow, then you'll want to use tasks.
Properties Of Airflow Operators :
- It defines the nature of the task and how it should be executed.
- When an operator is instantiated, the task becomes a node in DAG.
- It automatically retries in case of failures.
Types Of Airflow Operators :
- It is a program that performs a certain action.
- For Example, EmailOperator, and BashOperator.
- It is responsible for moving data from one system to another.
- If you're working with a large dataset, avoid using this Operator.
- Sensor Operator waits for data to arrive at a defined location.
- They are long-running tasks.
- They are useful for keeping track of external processes like file uploading.
Operators play a crucial role in the airflow process. We'll go through a few of the most popular operators later, but first, let's look at the relationship between a task and an operator.
The differences between a task and an operator might be confusing at first. The figure below might help you understand the relation between DAG, Task, and Operator.
Tasks are ideally self-contained and do not rely on information from other tasks. When you run an operator class object, it becomes a task. Generally, operators() produce <operator.objects> that are transformed into tasks.
Defining The Dag
Defining Relations between Tasks
Let's have a look at some of the most popular operators:
Apache Airflow Bash Operator - Executes a bash command
BashOperator in Apache Airflow provides a simple method to run bash commands in your workflow. This is the operator you'll want to use to specify the job if your DAG performs a bash command or script.
Apache Airflow Python Operator - Calls an arbitrary python function
The Airflow PythonOperator provides a basic yet effective operator that lets you run a Python callable function from your DAG.
Apache Airflow Email Operator - Sends an email
EmailOperator is the most straightforward method for sending emails from airflow. With Email Operator, you can send task-related emails or build up an alerting system. The biggest drawback is that this operator isn't very customizable.
Apache Airflow PostgresOperator
The Postgres Operator interface defines tasks that interact with the PostgreSQL database. It will be used to create tables, remove records, insert records, and more.
Apache Airflow SSH Operator
SSHOperator is used to execute commands on a given remote host using the ssh_hook.
Apache Airflow Docker Operator
Docker Operator helps to execute commands inside a docker container. Docker is a tool for creating and managing "containers," which are tiny virtual systems where you may run your code. With the help of the airflow docker operator, you can store files in a temporary directory created on the host and mounted into the container.
Apache Airflow HTTP Operator
To perform an activity, a call is made to an endpoint on an HTTP system. This is beneficial if you're using an API that returns a big JSON payload and you're only interested in a part of it.
Apache Airflow Snowflake Operator
SnowflakeOperator performs SQL commands on a Snowflake database. These operators can create, insert, merge, update, delete, copy into, and terminate tasks if needed.
Apache Airflow Spark Operators
Apache Spark is a general-purpose cluster computing solution that is quick and scalable. It provides Spark SQL for SQL and structured data processing, MLlib for machine learning, and a lot more. All of the configurations for SparkSqlOperator come from the operator parameters.
Apache Airflow SensorBase Operators
Sensor operators continue to run at a set interval, succeeding when a set of criteria is satisfied and failing if they time out. Is it necessary for you to wait for a file? Is it possible to see if a SQL item exists? Is it possible to postpone the execution of your DAG? That is the extent of the Airflow Sensors' capabilities.
Apache Airflow Bigquery Operators
BigQueryCheckOperator can be used to execute checks against BigQuery.
There are many more operators; you can view the complete list of airflow operators.
Check Apache Documentation to learn more.
Apache Airflow Operators Best Practices
To get the most out of these operators, you must know what they do and when it’s appropriate to apply them in your particular use case. In general, airflow operators fall into two categories: scheduling tasks or data manipulation tasks. A scheduling operator will schedule events based on some time pattern, such as expiring over a given amount of time. A data manipulation operator will perform a specific processing task on incoming data sets, such as breaking out tables for better query-ability.
- Do not use Airflow in the same codebase with databases. If someone else is modifying your models, Airflow will throw errors when running it in the database. This may confuse you and make operation time longer. Also, be sure to leave comments in all of your code to explain what each line does. This makes it easier for people unfamiliar with Airflow to figure out how things work.
- Use operators sparingly. Operators can be great and save time, but they can also be very time-consuming and bloated.
- The Airflow scheduler regularly triggers a DAG depending on the start date and schedule interval parameters supplied in the DAG file. Instead of starting a DAG run at the beginning of its schedule period, the Airflow scheduler starts it towards the conclusion. Consider the following DAG, which runs every day at 9 a.m:
dag = DAG('dagname', default_args=default_args, schedule_interval='0 9 * * *')
- Use operators that do not change existing data like: stream_tasks_by_updated_time and update_datasets_by_.
- A DAG ID must be specified when creating a DAG object. Across all of your DAGs, the DAG ID must be unique. If you have two DAGs with the same DAG ID, only one of them will appear, and you may see unexpected behavior. It is best to define a suitable DAG ID and a DAG description. If you use an alias to reference the id, you will get an error saying it does not match data when you run that model.
- Give your sensor a timeout parameter at all times. Consider your application and how long you expect the sensor to be waiting before adjusting the sensor's timeout.
- Use the poke mode if your poke interval is relatively short. For instance, using reschedule mode may cause your scheduler to become overloaded. Use the reschedule mode whenever possible, especially for long-running sensors, to avoid your sensor consuming a worker slot all of the time. This avoids deadlocks in Airflow, where sensors use all available worker slots.
Learn how to Improve model health with Censius AI
Several MLOps tools are available, but Apache Airflow offers unique advantages, and more businesses are using it to manage their data pipelines. An AirFlow Operator is an orchestrator for data delivered by an Airflow pipeline. The operator tells the pipeline where to send data, how often to send it, and what actions to take when new data arrives. We looked at what operators are and discussed several types of operators in this article. I hope you enjoyed the article and stay safe!
PS: Are you struggling to monitor and troubleshoot your ML Model data workflows? Look no further than Censius AI Observability Tool! Our platform provides real-time visibility into your data pipelines, allowing you to quickly identify and address issues before they impact your business.
Try Censius AI Observability Tool today and streamline your model data management.