Mounting DAGs
DAGs can be mounted by using a ConfigMap, git-sync, or - on Airflow 3.x - DAG bundles.
This is best illustrated with an example of each, shown in the sections below.
Via ConfigMap
---
apiVersion: v1
kind: ConfigMap
metadata:
name: cm-dag (1)
data:
test_airflow_dag.py: | (2)
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.dummy import DummyOperator
with DAG(
dag_id='test_airflow_dag',
schedule='0 0 * * *',
start_date=datetime(2021, 1, 1),
catchup=False,
dagrun_timeout=timedelta(minutes=60),
tags=['example', 'example2'],
params={"example_key": "example_value"},
) as dag:
run_this_last = DummyOperator(
task_id='run_this_last',
)
# [START howto_operator_bash]
run_this = BashOperator(
task_id='run_after_loop',
bash_command='echo 1',
)
# [END howto_operator_bash]
run_this >> run_this_last
for i in range(3):
task = BashOperator(
task_id='runme_' + str(i),
bash_command='echo "{{ task_instance_key_str }}" && sleep 1',
)
task >> run_this
# [START howto_operator_bash_template]
also_run_this = BashOperator(
task_id='also_run_this',
bash_command='echo "run_id={{ run_id }} | dag_run={{ dag_run }}"',
)
# [END howto_operator_bash_template]
also_run_this >> run_this_last
# [START howto_operator_bash_skip]
this_will_skip = BashOperator(
task_id='this_will_skip',
bash_command='echo "hello world"; exit 99;',
dag=dag,
)
# [END howto_operator_bash_skip]
this_will_skip >> run_this_last
if __name__ == "__main__":
dag.cli()
---
apiVersion: airflow.stackable.tech/v1alpha1
kind: AirflowCluster
metadata:
name: airflow
spec:
image:
productVersion: 3.1.6
clusterConfig:
loadExamples: false
exposeConfig: false
credentialsSecret: simple-airflow-credentials
volumes:
- name: cm-dag (3)
configMap:
name: cm-dag (4)
volumeMounts:
- name: cm-dag (5)
mountPath: /dags/test_airflow_dag.py (6)
subPath: test_airflow_dag.py (7)
webservers:
roleConfig:
listenerClass: external-unstable
roleGroups:
default:
envOverrides:
AIRFLOW__CORE__DAGS_FOLDER: "/dags" (8)
replicas: 1
celeryExecutors:
roleGroups:
default:
envOverrides:
AIRFLOW__CORE__DAGS_FOLDER: "/dags" (8)
replicas: 2
schedulers:
roleGroups:
default:
envOverrides:
AIRFLOW__CORE__DAGS_FOLDER: "/dags" (8)
replicas: 1
| 1 | The name of the ConfigMap |
| 2 | The name of the DAG (this is a renamed copy of the example_bash_operator.py from the Airflow examples) |
| 3 | The volume backed by the ConfigMap |
| 4 | The name of the ConfigMap referenced by the Airflow cluster |
| 5 | The name of the mounted volume |
| 6 | The path of the mounted resource. Note that should map to a single DAG. |
| 7 | The resource has to be defined using subPath: this is to prevent the versioning of ConfigMap elements which may cause a conflict with how Airflow propagates DAGs between its components. |
| 8 | If the mount path described above is anything other than the standard location (the default is $AIRFLOW_HOME/dags), then the location should be defined using the relevant environment variable. |
|
If a DAG mounted via ConfigMap consists of modularized files, Python uses this as a "root" directory when looking for referenced files.
If this is the case, then either the standard DAGs location should be used, or
|
The advantage of this approach is that DAGs are provided "in-line".
However, handling multiple DAGs this way becomes cumbersome, as each must be mapped individually.
For multiple DAGs, it is easier to expose them via gitsync, as shown below.
Via git-sync
git-sync is a command that pulls a git repository into a local directory and is supplied as a sidecar container for use within Kubernetes. The Stackable Airflow images already ship with git-sync included, and the operator takes care of calling the tool and mounting volumes, so that only the repository and synchronization details are required:
---
apiVersion: airflow.stackable.tech/v1alpha1
kind: AirflowCluster
metadata:
name: airflow
spec:
image:
productVersion: 3.1.6
clusterConfig:
loadExamples: false
exposeConfig: false
credentialsSecret: test-airflow-credentials (1)
dagsGitSync: (2)
- repo: https://github.com/stackabletech/airflow-operator (3)
branch: "main" (4)
gitFolder: "tests/templates/kuttl/mount-dags-gitsync/dags" (5)
depth: 10 (6)
wait: 20s (7)
credentials:
basicAuthSecretName: git-credentials (8)
gitSyncConf: (9)
--rev: HEAD (10)
# --rev: git-sync-tag # N.B. tag must be covered by "depth" (the number of commits to clone)
# --rev: 39ee3598bd9946a1d958a448c9f7d3774d7a8043 # N.B. commit must be covered by "depth"
tls:
verification:
server:
caCert:
secretClass: git-ca-cert (11)
webservers:
...
---
apiVersion: v1
kind: Secret
metadata:
name: git-credentials (8)
type: Opaque
data:
user: c3Rh...
password: Z2l0a...
| 1 | A Secret used for accessing database and admin user details (included here to illustrate where different credential secrets are defined) |
| 2 | The git-gync configuration block that contains list of git-sync elements |
| 3 | The repository to clone (required) |
| 4 | The branch name (defaults to main) |
| 5 | The location of the DAG folder, relative to the synced repository root.
It can optionally start with /, however, no trailing slash is recommended.
An empty string (`) or slash (/`) corresponds to the root folder in Git.
Defaults to "/". |
| 6 | The depth of syncing i.e. the number of commits to clone (defaults to 1) |
| 7 | The synchronisation interval in seconds, e.g. 20s or 1h (defaults to "20s") |
| 8 | The name of the Secret used to access the repository if it is not public.
This should include two fields: user and password (which can be either a password — which is not recommended — or a GitHub token, as described here) |
| 9 | A map of optional configuration settings that are listed in this configuration section (and the ones that follow on that link) |
| 10 | An example showing how to specify a target revision (the default is HEAD).
The revision can also be a tag or a commit, though this assumes that the target hash is contained within the number of commits specified by depth.
If a tag or commit hash is specified, then git-sync recognizes this and does not perform further cloning.
Git-sync settings can be provided inline, although some of these (--dest, --root) are specified internally in the operator and are ignored if provided by the user.
Git-config settings can also be specified, although a warning is logged if safe.directory is specified as this is defined internally, and should not be defined by the user. |
| 11 | An optional reference to the SecretClass used for holding CA certificates that will be used to verify the git server’s TLS certificate by passing it to the git config option http.sslCAInfo passed with the gitsync command.
The associated secret must have a key named ca.crt whose value is the PEM-encoded certificate bundle.
If this field is set to webPki: {} or is omitted altogether, then no changes will be made to the gitsync command and it will default to presenting no certificate to the backend.
Omitting this field is non-breaking behaviour and as such it does not set http.sslverify to false as disabling security checks should be a last resort and not something activated by default.
This can still be achieved explicitly: either by setting tls: verification: none: {} or by passing --git-config: http.sslverify=false as part of the gitSyncConf field. |
---
apiVersion: airflow.stackable.tech/v1alpha1
kind: AirflowCluster
metadata:
name: airflow
spec:
clusterConfig:
dagsGitSync:
- repo: ssh://git@github.com/stackable-airflow/dags.git (1)
credentials:
sshPrivateKeySecretName: git-sync-ssh (2)
...
---
apiVersion: v1
kind: Secret
metadata:
name: git-sync-ssh (2)
type: Opaque
data:
key: LS0tL...
knownHosts: Z2l0a...
| 1 | The name of the Secret used to access the repository if it is not public.
This should include two fields: key and knownHosts, both of which can contain multiple entries. |
| 2 | The secret referenced above. |
| git-sync can be used with DAGs that make use of Python modules, as Python is configured to use the git-sync target folder as the "root" location when looking for referenced files. See the Applying Custom Resources example for more details. |
Via DAG bundles (Airflow 3.x)
DAG bundles are an Airflow 3.x feature that natively supports loading DAGs from multiple sources - including multiple Git repositories - without requiring a git-sync sidecar. This is particularly useful when DAGs are maintained in separate repositories by different teams.
The Stackable Airflow operator does not have first-class CRD support for DAG bundles, but they can be configured using envOverrides to set the AIRFLOWDAG_PROCESSORDAG_BUNDLE_CONFIG_LIST environment variable.
No changes to the Stackable Airflow image are required: the apache-airflow-providers-git package and the git binary are both included in the standard image.
When to use DAG bundles instead of git-sync
Use git-sync (via dagsGitSync) when:
-
DAGs come from a single repository.
-
You need per-repository TLS/CA certificate configuration.
Use DAG bundles when:
-
DAGs come from multiple repositories and must all be visible to Airflow.
-
You want DAG versioning (each DAG run is pinned to the Git commit at the time it was created).
Prerequisites
-
Airflow 3.x (the
dag_bundle_config_listsetting does not exist in Airflow 2.x). -
Each
GitDagBundlerequires an Airflow Git connection - even for public repositories. For public repos, the connection only needs ahost(the repository URL) and no credentials. Connections can be created via the Airflow UI, CLI, a secrets backend, or - as shown in the example below - viaAIRFLOW_CONN_*environment variables. The operator does not manage Airflow connections.
Example
The following example configures two DAG bundles, each pulling from a public Git repository.
The Airflow connections are defined as AIRFLOW_CONN_* environment variables alongside the bundle configuration.
| This example points both bundles at the same repository and subdirectory for illustrative purposes. In practice, each bundle should reference a different repository (or at least a different subdirectory) with distinct DAG files. Airflow requires that DAG IDs are unique across the entire deployment: if two bundles define the same DAG ID, the last one parsed silently overwrites the other - with no error or warning - and the DAG may flip-flop between bundles on each parse cycle. |
The envOverrides are set at the role level (not the role group level) in all cases, so that they apply to all role groups within that role.
|
---
apiVersion: airflow.stackable.tech/v1alpha1
kind: AirflowCluster
metadata:
name: airflow-dag-bundles
spec:
image:
productVersion: 3.1.6
clusterConfig:
loadExamples: false
exposeConfig: false
credentialsSecret: airflow-credentials (1)
# dagsGitSync is intentionally not configured: DAG bundles replace git-sync (2)
webservers:
envOverrides: &bundleEnvOverrides (3)
AIRFLOW_CONN_REPO1: >- (4)
{"conn_type": "git", "host": "https://github.com/apache/airflow.git"}
AIRFLOW_CONN_REPO2: >-
{"conn_type": "git", "host": "https://github.com/apache/airflow.git"}
AIRFLOW__DAG_PROCESSOR__DAG_BUNDLE_CONFIG_LIST: >- (5)
[
{
"name": "repo1",
"classpath": "airflow.providers.git.bundles.git.GitDagBundle",
"kwargs": {
"git_conn_id": "repo1",
"tracking_ref": "3.1.6",
"subdir": "airflow-core/src/airflow/example_dags"
}
},
{
"name": "repo2",
"classpath": "airflow.providers.git.bundles.git.GitDagBundle",
"kwargs": {
"git_conn_id": "repo2",
"tracking_ref": "3.1.6",
"subdir": "airflow-core/src/airflow/example_dags"
}
}
]
roleGroups:
default:
replicas: 1
schedulers:
envOverrides: *bundleEnvOverrides (6)
roleGroups:
default:
replicas: 1
dagProcessors:
envOverrides: *bundleEnvOverrides
roleGroups:
default:
replicas: 1
kubernetesExecutors:
envOverrides: *bundleEnvOverrides
triggerers:
envOverrides: *bundleEnvOverrides
roleGroups:
default:
replicas: 1
| 1 | The credentials Secret for database and admin user access (same as any other Airflow cluster). |
| 2 | dagsGitSync is intentionally not configured.
DAG bundles replace the git-sync sidecar entirely. |
| 3 | A YAML anchor is used to define the environment variables once at the role level and reuse them across all roles. |
| 4 | Each bundle requires an Airflow Git connection.
Connections are defined as AIRFLOW_CONN_<CONN_ID> environment variables with a JSON value containing conn_type and host (the repository URL).
For private repositories, add login (username) and password (access token) fields for HTTPS auth, or key_file / private_key in the extra dict for SSH auth.
The connection ID in the env var name must be uppercase (e.g. AIRFLOW_CONN_REPO1), while the git_conn_id in the bundle config uses the lowercase form (repo1). |
| 5 | The AIRFLOWDAG_PROCESSORDAG_BUNDLE_CONFIG_LIST environment variable is a JSON list of bundle definitions.
Each entry specifies a name, the classpath of the bundle backend, and kwargs passed to the bundle constructor.
For GitDagBundle, the key kwargs are git_conn_id (referencing an Airflow connection), tracking_ref (branch or tag), and subdir (subdirectory within the repository containing DAGs). |
| 6 | The YAML anchor is referenced on all other roles so that every Airflow component sees the same bundle configuration. |
Airflow Git connection reference
When using GitDagBundle with private repositories, credentials are configured via an Airflow Git connection.
The table below shows which capabilities of the operator’s dagsGitSync fields have equivalents in a Git connection or GitDagBundle.
The connection field names (login, password, extra) refer to the JSON field names used in AIRFLOW_CONN_* environment variables.
The GitDagBundle kwargs are documented in the git provider bundles reference.
dagsGitSync field |
Git connection / GitDagBundle equivalent |
Parity |
|---|---|---|
|
Connection |
Full |
|
|
Full |
|
|
Full |
|
|
Full |
|
Connection username and access token fields (JSON keys |
Full - but the user must create the Airflow connection rather than referencing a Kubernetes Secret directly. |
|
Connection extra |
Full |
|
Connection extra |
Full |
|
No equivalent. |
None |
|
No equivalent. There is no pass-through mechanism for arbitrary git options. |
None |
|
Not supported by the Git provider. Workaround: set the |
None |
|
Implicit - Git uses the operating system’s CA trust store by default. |
Implicit |
|
Not supported by the Git provider. Workaround: mount the CA certificate and set the |
None (global workaround only) |
The Git connection also supports several extra keys not available in dagsGitSync:
| Connection extra key | Description |
|---|---|
|
Passphrase for encrypted SSH private keys. |
|
Path to a custom SSH configuration file. |
|
SSH |
|
Non-default SSH port (set via |
GitDagBundle itself also accepts two additional kwargs (see the bundles reference):
| Kwarg | Default | Description |
|---|---|---|
|
|
Initialise and update Git submodules recursively. |
|
|
Remove the |
Limitations
-
No per-repository TLS/CA certificates. The Airflow Git provider does not support custom CA certificates per connection. The only workaround is setting
GIT_SSL_CAINFOas a pod-level environment variable, which applies to all repositories. -
No clone depth control.
GitDagBundlealways performs a full clone. For large repositories this may increase pod startup time, particularly with the Kubernetes executor where each short-lived worker pod clones independently. -
No
gitSyncConfequivalent. There is no mechanism to pass arbitrary git or git-sync options through to the bundle. -
Triggerer limitation. The Airflow triggerer does not initialise DAG bundles. Custom trigger classes cannot be loaded from a bundle and must be installed as Python packages in the image.
-
Static configuration. Bundle definitions are read from configuration at process startup. Adding or removing a bundle requires updating the
envOverridesand restarting the affected pods.