Mounting DAGs

DAGs can be mounted by using a ConfigMap, git-sync, or - on Airflow 3.x - DAG bundles. This is best illustrated with an example of each, shown in the sections below.

Via ConfigMap

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: cm-dag (1)
data:
  test_airflow_dag.py: | (2)
    from datetime import datetime, timedelta
    from airflow import DAG
    from airflow.operators.bash import BashOperator
    from airflow.operators.dummy import DummyOperator

    with DAG(
        dag_id='test_airflow_dag',
        schedule='0 0 * * *',
        start_date=datetime(2021, 1, 1),
        catchup=False,
        dagrun_timeout=timedelta(minutes=60),
        tags=['example', 'example2'],
        params={"example_key": "example_value"},
    ) as dag:
        run_this_last = DummyOperator(
            task_id='run_this_last',
        )

        # [START howto_operator_bash]
        run_this = BashOperator(
            task_id='run_after_loop',
            bash_command='echo 1',
        )
        # [END howto_operator_bash]

        run_this >> run_this_last

        for i in range(3):
            task = BashOperator(
                task_id='runme_' + str(i),
                bash_command='echo "{{ task_instance_key_str }}" && sleep 1',
            )
            task >> run_this

        # [START howto_operator_bash_template]
        also_run_this = BashOperator(
            task_id='also_run_this',
            bash_command='echo "run_id={{ run_id }} | dag_run={{ dag_run }}"',
        )
        # [END howto_operator_bash_template]
        also_run_this >> run_this_last

    # [START howto_operator_bash_skip]
    this_will_skip = BashOperator(
        task_id='this_will_skip',
        bash_command='echo "hello world"; exit 99;',
        dag=dag,
    )
    # [END howto_operator_bash_skip]
    this_will_skip >> run_this_last

    if __name__ == "__main__":
        dag.cli()

---
apiVersion: airflow.stackable.tech/v1alpha1
kind: AirflowCluster
metadata:
  name: airflow
spec:
  image:
    productVersion: 3.1.6
  clusterConfig:
    loadExamples: false
    exposeConfig: false
    credentialsSecret: simple-airflow-credentials
    volumes:
      - name: cm-dag (3)
        configMap:
          name: cm-dag (4)
    volumeMounts:
      - name: cm-dag (5)
        mountPath: /dags/test_airflow_dag.py (6)
        subPath: test_airflow_dag.py (7)
  webservers:
    roleConfig:
      listenerClass: external-unstable
    roleGroups:
      default:
        envOverrides:
          AIRFLOW__CORE__DAGS_FOLDER: "/dags" (8)
        replicas: 1
  celeryExecutors:
    roleGroups:
      default:
        envOverrides:
          AIRFLOW__CORE__DAGS_FOLDER: "/dags" (8)
        replicas: 2
  schedulers:
    roleGroups:
      default:
        envOverrides:
          AIRFLOW__CORE__DAGS_FOLDER: "/dags" (8)
        replicas: 1

1	The name of the ConfigMap
2	The name of the DAG (this is a renamed copy of the `example_bash_operator.py` from the Airflow examples)
3	The volume backed by the ConfigMap
4	The name of the ConfigMap referenced by the Airflow cluster
5	The name of the mounted volume
6	The path of the mounted resource. Note that should map to a single DAG.
7	The resource has to be defined using `subPath`: this is to prevent the versioning of ConfigMap elements which may cause a conflict with how Airflow propagates DAGs between its components.
8	If the mount path described above is anything other than the standard location (the default is `$AIRFLOW_HOME/dags`), then the location should be defined using the relevant environment variable.

If a DAG mounted via ConfigMap consists of modularized files, Python uses this as a "root" directory when looking for referenced files. If this is the case, then either the standard DAGs location should be used, or PYTHONPATH should be overriden to point to the new location (it is also necessary to include the logging configuration in the path) as shown below:

    envOverrides: &envOverrides
      AIRFLOW__CORE__DAGS_FOLDER: "/dags"
      PYTHONPATH: "/stackable/app/log_config:/dags"

The advantage of this approach is that DAGs are provided "in-line". However, handling multiple DAGs this way becomes cumbersome, as each must be mapped individually. For multiple DAGs, it is easier to expose them via gitsync, as shown below.

Via `git-sync`

git-sync is a command that pulls a git repository into a local directory and is supplied as a sidecar container for use within Kubernetes. The Stackable Airflow images already ship with git-sync included, and the operator takes care of calling the tool and mounting volumes, so that only the repository and synchronization details are required:

git-sync usage example: https

---
apiVersion: airflow.stackable.tech/v1alpha1
kind: AirflowCluster
metadata:
  name: airflow
spec:
  image:
    productVersion: 3.1.6
  clusterConfig:
    loadExamples: false
    exposeConfig: false
    credentialsSecret: test-airflow-credentials (1)
    dagsGitSync: (2)
      - repo: https://github.com/stackabletech/airflow-operator (3)
        branch: "main" (4)
        gitFolder: "tests/templates/kuttl/mount-dags-gitsync/dags" (5)
        depth: 10 (6)
        wait: 20s (7)
        credentials:
          basicAuthSecretName: git-credentials (8)
        gitSyncConf: (9)
          --rev: HEAD (10)
          # --rev: git-sync-tag # N.B. tag must be covered by "depth" (the number of commits to clone)
          # --rev: 39ee3598bd9946a1d958a448c9f7d3774d7a8043 # N.B. commit must be covered by "depth"
        tls:
          verification:
            server:
              caCert:
                secretClass: git-ca-cert (11)
  webservers:
    ...
---
apiVersion: v1
kind: Secret
metadata:
  name: git-credentials (8)
type: Opaque
data:
  user: c3Rh...
  password: Z2l0a...

1	A Secret used for accessing database and admin user details (included here to illustrate where different credential secrets are defined)
2	The git-gync configuration block that contains list of git-sync elements
3	The repository to clone (required)
4	The branch name (defaults to `main`)
5	The location of the DAG folder, relative to the synced repository root. It can optionally start with `/`, however, no trailing slash is recommended. An empty string (`) or slash (/`) corresponds to the root folder in Git. Defaults to "/".
6	The depth of syncing i.e. the number of commits to clone (defaults to 1)
7	The synchronisation interval in seconds, e.g. `20s` or `1h` (defaults to "20s")
8	The name of the Secret used to access the repository if it is not public. This should include two fields: `user` and `password` (which can be either a password — which is not recommended — or a GitHub token, as described here)
9	A map of optional configuration settings that are listed in this configuration section (and the ones that follow on that link)
10	An example showing how to specify a target revision (the default is HEAD). The revision can also be a tag or a commit, though this assumes that the target hash is contained within the number of commits specified by `depth`. If a tag or commit hash is specified, then git-sync recognizes this and does not perform further cloning. Git-sync settings can be provided inline, although some of these (`--dest`, `--root`) are specified internally in the operator and are ignored if provided by the user. Git-config settings can also be specified, although a warning is logged if `safe.directory` is specified as this is defined internally, and should not be defined by the user.
11	An optional reference to the SecretClass used for holding CA certificates that will be used to verify the git server’s TLS certificate by passing it to the git config option `http.sslCAInfo` passed with the gitsync command. The associated secret must have a key named `ca.crt` whose value is the PEM-encoded certificate bundle. If this field is set to `webPki: {}` or is omitted altogether, then no changes will be made to the gitsync command and it will default to presenting no certificate to the backend. Omitting this field is non-breaking behaviour and as such it does not set `http.sslverify` to `false` as disabling security checks should be a last resort and not something activated by default. This can still be achieved explicitly: either by setting `tls: verification: none: {}` or by passing `--git-config: http.sslverify=false` as part of the `gitSyncConf` field.

git-sync usage example: ssh

---
apiVersion: airflow.stackable.tech/v1alpha1
kind: AirflowCluster
metadata:
  name: airflow
spec:
  clusterConfig:
    dagsGitSync:
      - repo: ssh://git@github.com/stackable-airflow/dags.git (1)
        credentials:
          sshPrivateKeySecretName: git-sync-ssh (2)
...

---
apiVersion: v1
kind: Secret
metadata:
  name: git-sync-ssh (2)
type: Opaque
data:
  key: LS0tL...
  knownHosts: Z2l0a...

1	The name of the Secret used to access the repository if it is not public. This should include two fields: `key` and `knownHosts`, both of which can contain multiple entries.
2	The secret referenced above.

git-sync can be used with DAGs that make use of Python modules, as Python is configured to use the git-sync target folder as the "root" location when looking for referenced files. See the Applying Custom Resources example for more details.

Via DAG bundles (Airflow 3.x)

DAG bundles are an Airflow 3.x feature that natively supports loading DAGs from multiple sources - including multiple Git repositories - without requiring a git-sync sidecar. This is particularly useful when DAGs are maintained in separate repositories by different teams.

The Stackable Airflow operator does not have first-class CRD support for DAG bundles, but they can be configured using envOverrides to set the AIRFLOWDAG_PROCESSORDAG_BUNDLE_CONFIG_LIST environment variable. No changes to the Stackable Airflow image are required: the apache-airflow-providers-git package and the git binary are both included in the standard image.

When to use DAG bundles instead of git-sync

Use git-sync (via dagsGitSync) when:

DAGs come from a single repository.
You need per-repository TLS/CA certificate configuration.

Use DAG bundles when:

DAGs come from multiple repositories and must all be visible to Airflow.
You want DAG versioning (each DAG run is pinned to the Git commit at the time it was created).

Prerequisites

Airflow 3.x (the dag_bundle_config_list setting does not exist in Airflow 2.x).
Each GitDagBundle requires an Airflow Git connection - even for public repositories. For public repos, the connection only needs a host (the repository URL) and no credentials. Connections can be created via the Airflow UI, CLI, a secrets backend, or - as shown in the example below - via AIRFLOW_CONN_* environment variables. The operator does not manage Airflow connections.

Example

The following example configures two DAG bundles, each pulling from a public Git repository. The Airflow connections are defined as AIRFLOW_CONN_* environment variables alongside the bundle configuration.

This example points both bundles at the same repository and subdirectory for illustrative purposes. In practice, each bundle should reference a different repository (or at least a different subdirectory) with distinct DAG files. Airflow requires that DAG IDs are unique across the entire deployment: if two bundles define the same DAG ID, the last one parsed silently overwrites the other - with no error or warning - and the DAG may flip-flop between bundles on each parse cycle.

The envOverrides are set at the role level (not the role group level) in all cases, so that they apply to all role groups within that role.

---
apiVersion: airflow.stackable.tech/v1alpha1
kind: AirflowCluster
metadata:
  name: airflow-dag-bundles
spec:
  image:
    productVersion: 3.1.6
  clusterConfig:
    loadExamples: false
    exposeConfig: false
    credentialsSecret: airflow-credentials (1)
    # dagsGitSync is intentionally not configured: DAG bundles replace git-sync (2)
  webservers:
    envOverrides: &bundleEnvOverrides (3)
      AIRFLOW_CONN_REPO1: >- (4)
        {"conn_type": "git", "host": "https://github.com/apache/airflow.git"}
      AIRFLOW_CONN_REPO2: >-
        {"conn_type": "git", "host": "https://github.com/apache/airflow.git"}
      AIRFLOW__DAG_PROCESSOR__DAG_BUNDLE_CONFIG_LIST: >- (5)
        [
          {
            "name": "repo1",
            "classpath": "airflow.providers.git.bundles.git.GitDagBundle",
            "kwargs": {
              "git_conn_id": "repo1",
              "tracking_ref": "3.1.6",
              "subdir": "airflow-core/src/airflow/example_dags"
            }
          },
          {
            "name": "repo2",
            "classpath": "airflow.providers.git.bundles.git.GitDagBundle",
            "kwargs": {
              "git_conn_id": "repo2",
              "tracking_ref": "3.1.6",
              "subdir": "airflow-core/src/airflow/example_dags"
            }
          }
        ]
    roleGroups:
      default:
        replicas: 1
  schedulers:
    envOverrides: *bundleEnvOverrides (6)
    roleGroups:
      default:
        replicas: 1
  dagProcessors:
    envOverrides: *bundleEnvOverrides
    roleGroups:
      default:
        replicas: 1
  kubernetesExecutors:
    envOverrides: *bundleEnvOverrides
  triggerers:
    envOverrides: *bundleEnvOverrides
    roleGroups:
      default:
        replicas: 1

1	The credentials Secret for database and admin user access (same as any other Airflow cluster).
2	`dagsGitSync` is intentionally not configured. DAG bundles replace the git-sync sidecar entirely.
3	A YAML anchor is used to define the environment variables once at the role level and reuse them across all roles.
4	Each bundle requires an Airflow Git connection. Connections are defined as `AIRFLOW_CONN_<CONN_ID>` environment variables with a JSON value containing `conn_type` and `host` (the repository URL). For private repositories, add `login` (username) and `password` (access token) fields for HTTPS auth, or `key_file` / `private_key` in the `extra` dict for SSH auth. The connection ID in the env var name must be uppercase (e.g. `AIRFLOW_CONN_REPO1`), while the `git_conn_id` in the bundle config uses the lowercase form (`repo1`).
5	The `AIRFLOWDAG_PROCESSORDAG_BUNDLE_CONFIG_LIST` environment variable is a JSON list of bundle definitions. Each entry specifies a `name`, the `classpath` of the bundle backend, and `kwargs` passed to the bundle constructor. For `GitDagBundle`, the key kwargs are `git_conn_id` (referencing an Airflow connection), `tracking_ref` (branch or tag), and `subdir` (subdirectory within the repository containing DAGs).
6	The YAML anchor is referenced on all other roles so that every Airflow component sees the same bundle configuration.

Airflow Git connection reference

When using GitDagBundle with private repositories, credentials are configured via an Airflow Git connection. The table below shows which capabilities of the operator’s dagsGitSync fields have equivalents in a Git connection or GitDagBundle. The connection field names (login, password, extra) refer to the JSON field names used in AIRFLOW_CONN_* environment variables. The GitDagBundle kwargs are documented in the git provider bundles reference.

dagsGitSync field Git connection / GitDagBundle equivalent Parity

`dagsGitSync` field	Git connection / `GitDagBundle` equivalent	Parity
`repo`	Connection `host` field, or `GitDagBundle` `repo_url` kwarg.	Full
`branch`	`GitDagBundle` `tracking_ref` kwarg. Accepts branches, tags, or commit hashes.	Full
`gitFolder`	`GitDagBundle` `subdir` kwarg.	Full
`wait`	`GitDagBundle` `refresh_interval` kwarg (integer, in seconds).	Full
`credentials.basicAuthSecretName`	Connection username and access token fields (JSON keys `login` and `password` in `AIRFLOW_CONN_*` env vars).	Full - but the user must create the Airflow connection rather than referencing a Kubernetes Secret directly.
`credentials.sshPrivateKeySecretName` (key)	Connection extra `key_file` (path to a mounted key file) or `private_key` (inline key content). Mutually exclusive.	Full
`credentials.sshPrivateKeySecretName` (knownHosts)	Connection extra `known_hosts_file` (path to a mounted file) and `strict_host_key_checking` (defaults to `"no"`). To replicate git-sync’s known-hosts verification, set `strict_host_key_checking` to `"yes"` and provide a `known_hosts_file`.	Full
`depth`	No equivalent. `GitDagBundle` always performs a full clone.	None
`gitSyncConf`	No equivalent. There is no pass-through mechanism for arbitrary git options.	None
`tls.verification.none`	Not supported by the Git provider. Workaround: set the `GIT_SSL_NO_VERIFY=true` environment variable on the pod (applies globally to all repositories).	None
`tls.verification.server.caCert.webPki`	Implicit - Git uses the operating system’s CA trust store by default.	Implicit
`tls.verification.server.caCert.secretClass`	Not supported by the Git provider. Workaround: mount the CA certificate and set the `GIT_SSL_CAINFO` environment variable on the pod, but this applies globally to all repositories, not per-repo.	None (global workaround only)

repo

Connection host field, or GitDagBundle repo_url kwarg.

Full

branch

GitDagBundle tracking_ref kwarg. Accepts branches, tags, or commit hashes.

Full

gitFolder

GitDagBundle subdir kwarg.

Full

wait

GitDagBundle refresh_interval kwarg (integer, in seconds).

Full

credentials.basicAuthSecretName

Connection username and access token fields (JSON keys login and password in AIRFLOW_CONN_* env vars).

Full - but the user must create the Airflow connection rather than referencing a Kubernetes Secret directly.

credentials.sshPrivateKeySecretName (key)

Connection extra key_file (path to a mounted key file) or private_key (inline key content). Mutually exclusive.

Full

credentials.sshPrivateKeySecretName (knownHosts)

Connection extra known_hosts_file (path to a mounted file) and strict_host_key_checking (defaults to "no"). To replicate git-sync’s known-hosts verification, set strict_host_key_checking to "yes" and provide a known_hosts_file.

Full

depth

No equivalent. GitDagBundle always performs a full clone.

None

gitSyncConf

No equivalent. There is no pass-through mechanism for arbitrary git options.

None

tls.verification.none

Not supported by the Git provider. Workaround: set the GIT_SSL_NO_VERIFY=true environment variable on the pod (applies globally to all repositories).

None

tls.verification.server.caCert.webPki

Implicit - Git uses the operating system’s CA trust store by default.

Implicit

tls.verification.server.caCert.secretClass

Not supported by the Git provider. Workaround: mount the CA certificate and set the GIT_SSL_CAINFO environment variable on the pod, but this applies globally to all repositories, not per-repo.

None (global workaround only)

The Git connection also supports several extra keys not available in dagsGitSync:

Connection extra key Description

Connection extra key	Description
`private_key_passphrase`	Passphrase for encrypted SSH private keys.
`ssh_config_file`	Path to a custom SSH configuration file.
`host_proxy_cmd`	SSH `ProxyCommand` for connecting through bastion or jump hosts.
`ssh_port`	Non-default SSH port (set via `-p` on the SSH command).

private_key_passphrase

Passphrase for encrypted SSH private keys.

ssh_config_file

Path to a custom SSH configuration file.

host_proxy_cmd

SSH ProxyCommand for connecting through bastion or jump hosts.

ssh_port

Non-default SSH port (set via -p on the SSH command).

GitDagBundle itself also accepts two additional kwargs (see the bundles reference):

Kwarg Default Description

Kwarg	Default	Description
`submodules`	`false`	Initialise and update Git submodules recursively.
`prune_dotgit_folder`	`true`	Remove the `.git` folder from version clones to save disk space. Forced to `false` when `submodules` is `true`.

submodules

false

Initialise and update Git submodules recursively.

prune_dotgit_folder

true

Remove the .git folder from version clones to save disk space. Forced to false when submodules is true.

Limitations

No per-repository TLS/CA certificates. The Airflow Git provider does not support custom CA certificates per connection. The only workaround is setting GIT_SSL_CAINFO as a pod-level environment variable, which applies to all repositories.
No clone depth control. GitDagBundle always performs a full clone. For large repositories this may increase pod startup time, particularly with the Kubernetes executor where each short-lived worker pod clones independently.
No gitSyncConf equivalent. There is no mechanism to pass arbitrary git or git-sync options through to the bundle.
Triggerer limitation. The Airflow triggerer does not initialise DAG bundles. Custom trigger classes cannot be loaded from a bundle and must be installed as Python packages in the image.
Static configuration. Bundle definitions are read from configuration at process startup. Adding or removing a bundle requires updating the envOverrides and restarting the affected pods.

Mounting DAGs

Via ConfigMap

Via git-sync

Via DAG bundles (Airflow 3.x)

When to use DAG bundles instead of git-sync

Prerequisites

Example

Airflow Git connection reference

Limitations

Via `git-sync`