Airflow is a workflow management platform for building data engineering pipelines. It uses Python to define your data engineering workflow as code, making them more maintainable, versionable, testable and collaborative.
In this article, I will try to explain a few important concepts and terminology associated with Airflow.
DAGs
A DAG stands for a directed acyclic graph. It is the data pipeline as defined by your code in Airflow. It consists of a series of tasks linked to one another, without any loops, as shown in this example.
Tasks
Tasks are the individual components of a DAG. The tasks have upstream or downstream dependencies set between them to express the order they should run in.
Operators
Operators in airflow contain the logic for a task, and they provide integration with external languages or tools or such as Bash, Python, PostgreSQL, Docker etc. The 3 types of operators used in Airflow are:
- Action Operators :
These operators are responsible for executing a task. For example, a PythonOperator executes a python function.
- Transfer Operators:
These operators transfer data from source to destination. For example a PostgreSqlOperator
- Sensors:
Sensors are used to wait for something to happen, before executing the next task. For example, a HttpSensor can check if the API endpoint is available before the next task can make a get request to the API.
XComs
XComs allow tasks in Airflow to pass data between Tasks, which are otherwise completely isolated from each other and might even be running on different machines. Using XComs, we might be able to pass variables between the tasks within a DAG. Note that only small amounts of data should be passed using XComs, and they should not be used to transfer data frames and tables.
TaskFlow
TaskFlow is a concept introduced in Airflow 2.0, and it allows you to write your DAGs as Python functions without using the PythonOperator. TaskFlow takes care of moving inputs and outputs between your Tasks using XComs for you, as well as automatically calculating dependencies, and all this is achieved simply by using the @task decorator
This covers all the basic concepts of Airflow.
To get weekly updates about my learning journey in data science and data engineering field, do subscribe to my newsletter @ https://calvinhobbes.substack.com/