Scheduling backups with GitLab CI/CD
I recently created a small Django project to manage a rolling menu for our family meals. It’s not a large project. I’m hosting it on a Digital Ocean droplet (for which I pay $4/month, and it runs more than one project), and I have a free-tier Postgres database with ElephantSQL. Because this is only a personal project, I didn’t want to pay for the database or any backups there (though my experience so far of ElephantSQL is positive, and for a larger project I may well do this! It’s very easy to spin up a new DB instance with them).
First, I thought I would just use a crontab job to run python manage.py dumpdata ...
on a regular basis. This, I’m sure, would have worked just fine. At the same time though it decouples the solution from the repository.
So, then I had another idea: GitLab pipelines allow for the scheduling of a pipeline. This is free functionality (as long as you don’t want to schedule a lot of stuff!). And yes, it worked!
Step 1 - Create a backup script
First, I wanted to run a little more than just the data dump. Ideally I would clean up old backups and keep some kind of grandfather/father/son system. So I wrote a quick bash script to manage this. It’s overly implistic, and I may refactor it to use logrotate
at some point, but for the moment, this does the job.
#!/bin/bash
set -e
PROJECT_DIR=/apps/<project>
VENV_NAME=<project venv name>
TODAY=$(date +"%Y%m%d")
TODAY_TS=${TODAY}_$(date +"%H%M%S")
TODAY_DAY_NO=$(date "+%u")
CURRENT_DAY=${TODAY:6:2}
CURRENT_MONTH=${TODAY:4:2}
CURRENT_YEAR=${TODAY:0:4}
LAST_DAY_OF_MONTH=$(cal "${CURRENT_MONTH} ${CURRENT_YEAR}" | awk 'NF {DAYS = $NF}; END {print DAYS}')
DAILY_BACKUP_FILE="db_daily_${TODAY_TS}.json"
BACKUP_DIR=/apps/data_backups/<project>/
BACKUP_RETENTION_DAILY=7
BACKUP_RETENTION_WEEKLY=$((5 * 7))
BACKUP_RETENTION_MONTHLY=$((6 * 30))
cd $PROJECT_DIR
printf "\n*** Activate virtual env: %s\n" "${VENV_NAME}"
pyenv activate $VENV_NAME
printf "\n*** Running data export\n"
python manage.py dumpdata --exclude auth.permission --exclude contenttypes --exclude admin.LogEntry --exclude sessions --indent 2 > "${BACKUP_DIR}${DAILY_BACKUP_FILE}"
(( TODAY_DAY_NO == 7 )) && {
printf "\n*** SUNDAY - Copying to weekly backup\n"
cp "${BACKUP_DIR}${DAILY_BACKUP_FILE}" "${BACKUP_DIR}db_weekly_${TODAY_TS}.json"
}
(( CURRENT_DAY == LAST_DAY_OF_MONTH )) && {
printf "\n*** END OF MONTH - Copying to monthly backup\n"
cp "${BACKUP_DIR}${DAILY_BACKUP_FILE}" "${BACKUP_DIR}db_monthly_${TODAY_TS}.json"
}
printf "\n*** Cleaning old backup files\n"
find ${BACKUP_DIR}db_daily_* -mtime +${BACKUP_RETENTION_DAILY} -exec rm {} \;
find ${BACKUP_DIR}db_weekly_* -mtime +${BACKUP_RETENTION_WEEKLY} -exec rm {} \; 2>/dev/null || true
find ${BACKUP_DIR}db_monthly_* -mtime +${BACKUP_RETENTION_MONTHLY} -exec rm {} \; 2>/dev/null || true
set +e
exit 0
A few notes:
- I am using
pyenv
to manage virtual environments. You may have a different solution and need to tweak the script accordingly. set -e
will throw an exception and stop processing if there is an error.- The DO droplet did not come with
cal
installed, so I needed tosudo apt install ncal
to get the last day of the month this way. - The retention values are in number of days. This is because I use
find
withmtime
to locate the old files, rather than going off the timestamp in the filename. - I’m using
pyenv
to manage my python versions and environments (yes, my app is called “menu”). - This script will run in a non-interactive ssh shell, so don’t rely on env vars being available. You may need to specify them in your script or change your
.bashrc
a little (other shells work differently).
A note on env vars and .bashrc
Because this script will get run when logged into your server via ssh in non-interactive mode, you will not necessarily have access to all environment vars and other settings contained in your .bashrc
. Exactly how this works differs a little across different flavours of Linux. On my Digital Ocean Ubuntu droplet, .bashrc
runs, but most of it is ignored becuase of the following code:
# If not running interactively, don't do anything
case $- in
*i*) ;;
*) return;;
esac
Some Linux flavours may not run .bashrc
at all. In my case, all I really need is for pyenv
to work so I can activate my project environment, and for the environment vars necessary for that project to be available.
To allow this, I moved up the pyenv
initialisation and loaded in my project settings at the top of the .bashrc
file.
# -- PYENV --
export PYENV_ROOT="$HOME/.pyenv"
command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"
# Load pyenv-virtualenv automatically
eval "$(pyenv virtualenv-init -)"
# -- LOAD PROJECT ENV VARS --
set -o allexport
source ~/appconfig/.env_project1
source ~/appconfig/.env_project2
set +o allexport
As you can see, I am running more than one project on this server, so I have seperate .env
files for environment vars for each project.
💡 Note that these changes are needed for the user that runs our project. If you have different users for each project, make sure you are configuring the correct user.
Step 2 - Configure SSH
For GitLab to connect to your server at all it needs to authenticate itself. This is handled via an SSH key. GitLab has good documentation on this, so I won’t go into too much detail here, but the basic steps are:
- Create an SSH key on your local machine using
ssh-keygen
. - Add the
$SSH_HOST
(IP address of your server),$SSH_PRIVATE_KEY
(that you generated in the previous step), and$SSH_USER
(the user that owns the applicatiion on your server, and that can do a deploy) environment variables to your repository under Settings > CI/CD > Variables. - Add the public key file to your server. I called mine
id_ed25519_gitlab_ci.pub
for clarity.
At this point, GitLab CI/CD has what it needs to connect to your server.
Step 3 - Add a CI backup job
Once GitLab CI/CD can authenticate itself and connect to your server, you can ask it to do so in the configutation. I have added a seperate backup
stage for clarity.
stages:
- backup
- ...
.script_config: &script_config
image: ubuntu:latest
before_script:
- apt-get -yq update
- apt-get -yqq install ssh
- install -m 600 -D /dev/null ~/.ssh/id_ed25519
- echo "$SSH_PRIVATE_KEY" | base64 -d > ~/.ssh/id_ed25519
- ssh-keyscan -H $SSH_HOST > ~/.ssh/known_hosts
after_script:
- rm -rf ~/.ssh
backup:
<<: *script_config
stage: backup
only:
- schedules
script:
- ssh $SSH_USER@$SSH_HOST "cd $WORK_DIR && . ./scripts/backup.sh"
Here, I have factored out the basic connection functionality to .script_config
. This is because I use a similar script to deploy my project via CI/CD, and that also needs to conect to the server to do so.
The before_script
section of .script_config
here is doing the following:
- Installing
ssh
. - Copying the private key from the GitLab CI/CD variable into the running Ubuntu container.
- Adding the host to
known_hosts
Likewise the after_script
section is cleaning up this to ensure no secure information is left behind in artefacts.
The backup
section itself will only run as a scheduled job, and simply runs the script you created in step
💡 Note that the variable
$WORK_DIR
needs adding to your CI/CD variables, and should correspond to the root directory of your project on the server.
Step 4 - Add the scheduled job
This final step is the simplest. Under CI/CD > Schedules add a new pipeline. You will need to provide the following information:
Description: A short description of the task that lets you identify the job later.
Interval pattern: There are simple selection options here, or you can define it using the standard pattern for crontab jobs.
Crontab timezone: The timezone to apply to the interval pattern.
Target branch or tag: The branch or tag that should be used for the job. Most likely main
.
Variables: Any variables that are specifically required for this job.
💡 You could set up the SSH vars here if you like and have them only apply to the pipeline. Because my server is only a staging server, run tasks on different projects using the same user, so I have my vars at the project level. If you have different users it may make sense to have them here.
Conclusion
And that’s it, you now have a scheduled task that will dump the database to a .json
file. Becasue I run multiple projects on the same server (mostly staging environments for different things), I can then pay Digital Ocean for a single backup for the one server, which will give me backups for all the data, rather than paying for individual database backups for each project.
This solution is almost certainly not fit for use in a high-security production environment, but for personal projects, staging environemnts with fake data in them etc. it is a good way of backing up multiple projects with a single backup and controlling it all via GitLab configuration.