Scheduling backups with GitLab CI/CD

I recently created a small Django project to manage a rolling menu for our family meals. It’s not a large project. I’m hosting it on a Digital Ocean droplet (for which I pay $4/month, and it runs more than one project), and I have a free-tier Postgres database with ElephantSQL. Because this is only a personal project, I didn’t want to pay for the database or any backups there (though my experience so far of ElephantSQL is positive, and for a larger project I may well do this! It’s very easy to spin up a new DB instance with them).

First, I thought I would just use a crontab job to run python manage.py dumpdata ... on a regular basis. This, I’m sure, would have worked just fine. At the same time though it decouples the solution from the repository.

So, then I had another idea: GitLab pipelines allow for the scheduling of a pipeline. This is free functionality (as long as you don’t want to schedule a lot of stuff!). And yes, it worked!

Step 1 - Create a backup script

First, I wanted to run a little more than just the data dump. Ideally I would clean up old backups and keep some kind of grandfather/father/son system. So I wrote a quick bash script to manage this. It’s overly implistic, and I may refactor it to use logrotate at some point, but for the moment, this does the job.

#!/bin/bash

set -e

PROJECT_DIR=/apps/<project>
VENV_NAME=<project venv name>

TODAY=$(date +"%Y%m%d")
TODAY_TS=${TODAY}_$(date +"%H%M%S")
TODAY_DAY_NO=$(date "+%u")
CURRENT_DAY=${TODAY:6:2}
CURRENT_MONTH=${TODAY:4:2}
CURRENT_YEAR=${TODAY:0:4}
LAST_DAY_OF_MONTH=$(cal "${CURRENT_MONTH} ${CURRENT_YEAR}" | awk 'NF {DAYS = $NF}; END {print DAYS}')

DAILY_BACKUP_FILE="db_daily_${TODAY_TS}.json"
BACKUP_DIR=/apps/data_backups/<project>/

BACKUP_RETENTION_DAILY=7
BACKUP_RETENTION_WEEKLY=$((5 * 7))
BACKUP_RETENTION_MONTHLY=$((6 * 30))

cd $PROJECT_DIR

printf "\n*** Activate virtual env: %s\n" "${VENV_NAME}"
pyenv activate $VENV_NAME

printf "\n*** Running data export\n"
python manage.py dumpdata --exclude auth.permission --exclude contenttypes --exclude admin.LogEntry --exclude sessions --indent 2 > "${BACKUP_DIR}${DAILY_BACKUP_FILE}"

(( TODAY_DAY_NO == 7 )) && {
    printf "\n*** SUNDAY - Copying to weekly backup\n"
    cp "${BACKUP_DIR}${DAILY_BACKUP_FILE}" "${BACKUP_DIR}db_weekly_${TODAY_TS}.json"
}

(( CURRENT_DAY == LAST_DAY_OF_MONTH )) && {
    printf "\n*** END OF MONTH - Copying to monthly backup\n"
    cp "${BACKUP_DIR}${DAILY_BACKUP_FILE}" "${BACKUP_DIR}db_monthly_${TODAY_TS}.json"
}

printf "\n*** Cleaning old backup files\n"
find ${BACKUP_DIR}db_daily_* -mtime +${BACKUP_RETENTION_DAILY} -exec rm {} \;
find ${BACKUP_DIR}db_weekly_* -mtime +${BACKUP_RETENTION_WEEKLY} -exec rm {} \; 2>/dev/null || true
find ${BACKUP_DIR}db_monthly_* -mtime +${BACKUP_RETENTION_MONTHLY} -exec rm {} \; 2>/dev/null || true

set +e

exit 0

A few notes:

I am using pyenv to manage virtual environments. You may have a different solution and need to tweak the script accordingly.
set -e will throw an exception and stop processing if there is an error.
The DO droplet did not come with cal installed, so I needed to sudo apt install ncal to get the last day of the month this way.
The retention values are in number of days. This is because I use find with mtime to locate the old files, rather than going off the timestamp in the filename.
I’m using pyenv to manage my python versions and environments (yes, my app is called “menu”).
This script will run in a non-interactive ssh shell, so don’t rely on env vars being available. You may need to specify them in your script or change your .bashrc a little (other shells work differently).

A note on env vars and `.bashrc`

Because this script will get run when logged into your server via ssh in non-interactive mode, you will not necessarily have access to all environment vars and other settings contained in your .bashrc. Exactly how this works differs a little across different flavours of Linux. On my Digital Ocean Ubuntu droplet, .bashrc runs, but most of it is ignored becuase of the following code:

# If not running interactively, don't do anything
case $- in
    *i*) ;;
      *) return;;
esac

Some Linux flavours may not run .bashrc at all. In my case, all I really need is for pyenv to work so I can activate my project environment, and for the environment vars necessary for that project to be available.

To allow this, I moved up the pyenv initialisation and loaded in my project settings at the top of the .bashrc file.

# -- PYENV --
export PYENV_ROOT="$HOME/.pyenv"
command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"
# Load pyenv-virtualenv automatically
eval "$(pyenv virtualenv-init -)"

# -- LOAD PROJECT ENV VARS --
set -o allexport
source ~/appconfig/.env_project1
source ~/appconfig/.env_project2
set +o allexport

As you can see, I am running more than one project on this server, so I have seperate .env files for environment vars for each project.

💡 Note that these changes are needed for the user that runs our project. If you have different users for each project, make sure you are configuring the correct user.

Step 2 - Configure SSH

For GitLab to connect to your server at all it needs to authenticate itself. This is handled via an SSH key. GitLab has good documentation on this, so I won’t go into too much detail here, but the basic steps are:

Create an SSH key on your local machine using ssh-keygen.
Add the $SSH_HOST (IP address of your server), $SSH_PRIVATE_KEY (that you generated in the previous step), and $SSH_USER (the user that owns the applicatiion on your server, and that can do a deploy) environment variables to your repository under Settings > CI/CD > Variables.
Add the public key file to your server. I called mine id_ed25519_gitlab_ci.pub for clarity.

At this point, GitLab CI/CD has what it needs to connect to your server.

Step 3 - Add a CI backup job

Once GitLab CI/CD can authenticate itself and connect to your server, you can ask it to do so in the configutation. I have added a seperate backup stage for clarity.

stages:
  - backup
  - ...

.script_config: &script_config
  image: ubuntu:latest
  before_script:
    - apt-get -yq update
    - apt-get -yqq install ssh
    - install -m 600 -D /dev/null ~/.ssh/id_ed25519
    - echo "$SSH_PRIVATE_KEY" | base64 -d > ~/.ssh/id_ed25519
    - ssh-keyscan -H $SSH_HOST > ~/.ssh/known_hosts
  after_script:
    - rm -rf ~/.ssh

backup:
  <<: *script_config
  stage: backup
  only:
    - schedules
  script:
    - ssh $SSH_USER@$SSH_HOST "cd $WORK_DIR && . ./scripts/backup.sh"

Here, I have factored out the basic connection functionality to .script_config. This is because I use a similar script to deploy my project via CI/CD, and that also needs to conect to the server to do so.

The before_script section of .script_config here is doing the following:

Installing ssh.
Copying the private key from the GitLab CI/CD variable into the running Ubuntu container.
Adding the host to known_hosts

Likewise the after_script section is cleaning up this to ensure no secure information is left behind in artefacts.

The backup section itself will only run as a scheduled job, and simply runs the script you created in step

💡 Note that the variable $WORK_DIR needs adding to your CI/CD variables, and should correspond to the root directory of your project on the server.

Step 4 - Add the scheduled job

This final step is the simplest. Under CI/CD > Schedules add a new pipeline. You will need to provide the following information:

Description: A short description of the task that lets you identify the job later. Interval pattern: There are simple selection options here, or you can define it using the standard pattern for crontab jobs. Crontab timezone: The timezone to apply to the interval pattern. Target branch or tag: The branch or tag that should be used for the job. Most likely main. Variables: Any variables that are specifically required for this job.

💡 You could set up the SSH vars here if you like and have them only apply to the pipeline. Because my server is only a staging server, run tasks on different projects using the same user, so I have my vars at the project level. If you have different users it may make sense to have them here.

Conclusion

And that’s it, you now have a scheduled task that will dump the database to a .json file. Becasue I run multiple projects on the same server (mostly staging environments for different things), I can then pay Digital Ocean for a single backup for the one server, which will give me backups for all the data, rather than paying for individual database backups for each project.

This solution is almost certainly not fit for use in a high-security production environment, but for personal projects, staging environemnts with fake data in them etc. it is a good way of backing up multiple projects with a single backup and controlling it all via GitLab configuration.

Michael J. Nicholson

Scheduling backups with GitLab CI/CD

Step 1 - Create a backup script

A note on env vars and `.bashrc`

Step 2 - Configure SSH

Step 3 - Add a CI backup job

Step 4 - Add the scheduled job

Conclusion

Table of Contents

Michael J. Nicholson

Scheduling backups with GitLab CI/CD

Step 1 - Create a backup script

A note on env vars and .bashrc

Step 2 - Configure SSH

Step 3 - Add a CI backup job

Step 4 - Add the scheduled job

Conclusion

Table of Contents

A note on env vars and `.bashrc`