16.10.2020

Turing Pi – Gitlab CI + Ansible

Where we left

In my last blog article I left you with some cloudy statement about Ansible and Gitlab CI and that we’re going to work with a Turing Pi for that purpose. The last article talked about the capabilities of this awesome board and the first steps with it. Today I’ll go a little bit into details which benefits we gained when running Ansible scripts in a CI environment and most important where the Turing Pi helps us achieving this

What we’re going to build with it?

In our project we have a lot of Raspis and configure them with Ansible. With time our Ansible are scripts getting more complex by adding features or fixing bugs. Besides that the team has grown and it will continue to grow in the future. We already do Code Reviews with Gitlab, but we did not have any validation that it will actually work on real hardware or even if the syntax is correct. So we put 4 Pis into a 19″ rack and integrated them as a deploy target for our merge requests. When a user opens a merge request, we reboot the machine and run the OS via the network. With that we can flash the SD card with a fresh Raspbian. After a reboot we can start our Ansible scripts to test if the merge request is sane.

With just running an Ansible Playbook for setting up hosts from 0 to 100 we get the following benefits:

Check if the syntax is correct
Ensure that all needed files are committed
Can different configuration flavors still be applied?

Why do we need netboot at all?

Our goal is to run our Ansible script against a fresh installed raspbian. We cannot be sure, that the last tested merge request was flawless. The idea is to reboot the node into an NFS based root file system. From there we can flash a raspbian and reboot into a clean OS. After that we can apply & test our Ansible scripts. Following this process guarantees, that every merge request gets a clean system without any leftovers from a previous run.

Setting up Gitlab CI/runner

To be honest – we’re currently stuck with a shell runner. We weren’t able to find a proper working docker image with ansible that doesn’t lack some tooling/libs that are currently required. Building a custom docker image would be possible but for the sake of simplicity we use a simple shell runner.

First of all we need to setup one node as a DHCP/TFTP boot server. As mentioned in my previous post there are a lot of good tutorials e.g. https://www.raspberrypi.org/documentation/hardware/raspberrypi/bootmodes/net_tutorial.md. We recommend to run one of the Turing Pi as a DHCP/TFTP boot server – it’s more reliable than a server placed outside of the board

One thing to note: The DHCP server is located behind a NAT router so it does not interfere with DHCP servers in our company network.

When the server is up and running, we need to set up the Gitlab runner.

1.) Due to the fact we rely on a shell runner Ansible has to be installed by hand. This can be done with your package manger. Also install all dependencies that are needed for a successful Ansible run.

2.) Install gitlab ci runner accordingly tohttps://www.raspberrypi.org/documentation/hardware/raspberrypi/bootmodes/net_tutorial.md . When asked which kind of runner choose shell runner

3.) Add a custom inventory

You can use your regular inventory and add there your ci host. Then you have to limit your Ansible run with –limit option. If omitted you’ll most probably deploy your prod system with every merge request 🙂

4.) Add or change .gitlab-ci.yml accordingly:

stages:
  - test-on-pis

test-on-turing-pi:
  stage: test-on-pis
  script:
    - export ANSIBLE_HOST_KEY_CHECKING=False
    - mkdir $HOME/.ssh || true
    - echo -e "Host 192.168.2.*\nStrictHostKeyChecking no\nUserKnownHostsFile=/dev/null" > $HOME/.ssh/config
    - ansible-playbook reflash.yml -i hostsCI.yml -e "ansible_ssh_pass=Bai9iZaetheeSee5"
    - ansible-playbook initial.yml -i hostsCI.yml -e "ansible_ssh_pass=raspberry"
    - ansible-playbook site.yml -i hostsCI.yml -e "ansible_ssh_pass=Bai9iZaetheeSee5" -e "http_proxy="
  except:
    - master

We do some “dirty” ssh tricks at the beginning. This ensures that ansible can connect to the systems even when the host keys have changed. That will happen with our setup after we reflashed the system. We tried to do this during an ansible run but weren’t successful.

Reflashing your Pi is really straight forward, when the pxe server is setup. Let’s have a look at some excerpt:

--
- name: Check if already pxe boot
  command: "/bin/bash -c \"findmnt / | awk '{ print $3}' | tail -n1\""
  register: root_mnt

- name: Destroy boot partition
  command: sudo sh -c 'ls -lha > /dev/mmcblk0p1; exit 0'
  when: root_mnt.stdout_lines[0] != "nfs"

- name: Reboot the machine into pxe
  reboot:
  when: root_mnt.stdout_lines[0] != "nfs"
....
- name: Flashing plane os to sd (gzip)
  command: sudo sh -c 'zcat /image_archive/{{ os | default("raspbian.img.gz") }} > /dev/mmcblk0'
  ignore_errors: yes

- name: Reboot the server
  become: yes
  become_user: root
  shell: "sleep 5 & reboot"
  async: 1
  poll: 0

- pause:
    seconds: 40

- name: Wait for the reboot and reconnect
  wait_for:
    port: 22
    host: '{{ (ansible_ssh_host}}'
    search_regex: OpenSSH
    delay: 60
    timeout: 120
  connection: local

First of all we check if the current system is already running on a nfs root filesystem. If so we can skip the next two steps. If not we destroy the current boot partition – just write some rubbish on the raw device. Next we reboot the system. Then we flash the sd card/emcc with the help of zcat. We compress our images to save some HD space and be able to transfer them via a wan connection(if needed). The performance impact is negligible. One reboot later and that’s it. We have a fresh installed raspbian and can apply our Ansible scripts to a fresh system.

Two things to note:

We ignore errors when flashing. When the image is missing or similar, we just continue and reboot the system. In most of the cases the system will be still accessible
I couldn’t get the reboot module working properly. I think the newly flashed system does not have the same remote user and password as when we started the role. With this ugly part it’s working 🙂

Be aware that the above script is an excerpt from the main.yml of the corresponding role and is not complete!

What we also do

After a successful run we apply some smoke tests. One flavor of our RPis is a kiosk systems which starts X11 and opens a chromium browser in fullscreen mode. With some “technical creativity”(aka mocking) we ensure that the browser is running. When this is not going to happen, the build fails. Another flavor is used for controlling an industrial actor and its basic functionality is also tested at the end. We do not test every aspect, it’s more of a kind of smoke tests or check if certain tcp ports are open.

What we also want to know how our Ansible scripts affect a running system which is not flatten or rebooted on every run. We deploy two node continuously when a merge to master happens. With that we can also monitor the memory usage on these systems and can identify possible memory leaks or stability issues when running a system longer time. It has served us well as an additional safety net.

Why the Turing Pi is ideal for us

Compared to our old setup we gained speed and stability of our builds. In the past we had problems when booting multiple pis in parallel from the network. Pis were stuck in the boot process and didn’t get their kernel via TFTP or couldn’t mount the nfs root reliably. Til this day, I’m not sure where I went wrong – I fixed it with throttling the reboots so that the Pis are booted one by one. That slowed down the “build” significantly. With the turing pi we did not encounter any problems booting pis in parallel from the network at the same time. So after testing it a bit I removed the throttle and reduced the build time significantly 🙂

Another thing we see a lot less stuck Ansible jobs because the network connection is flaky. I got the impression, that working with the network on the pcb is a lot more reliant. We use turing pi since two weeks as a replacement for our old “test cluster” – can’t remember that I had to reset the cluster manually caused by a frozen job.

Currently we work most of the time from home – Covid19 case numbers are rising in Germany and it looks like we’ll stay at home for some time. With some VPN and network magic we can use some nodes as test devices from remote. In the past everybody got his “personal” test pi, had to ensure proper network connection, flash the sd card manually etc. Yes, you can use Raspbian X86 but some things like accessing bluetooth or GPIOs cannot properly be tested in a virtual environment.

What we’ve learned as a team?

In our last review and retro we talked also about our CI setup. We’ve agreed, that it helps us a lot in our day-to-day business and adds to the quality of our merge requests. For me personally it is a very good incentive to use more merge requests so the CI pipeline is triggered and I am forced to work more in a clean way. But it takes time to set it up, stabilize some (ugly) parts – but after all the time investment is worth it. Additional tests and reviews improve the quality even in the context of configuring an OS and reduces the amount of errors.

What we miss with Turing Pi

Not much to be honest. Two things would be really awesome:

Switching the HDMI output from the first node to a different node via i2c
For flashing emcc the first slot only can be used. It would be great if you could also use the other slots for flashing as well.

I think achieving these two points isn’t easy and for us not really pain points 🙂

Where we (maybe) want to go

Currently we are deploying our prod systems still from our workstations. Continuos delivery will be a topic for us. But something like “merge on master” and deploy the prod system instantly is too risky for us. Our Pis are an integral part for our customer. And that’s not the definition for CD(for me :)). Our pis are not easily accessible in a matter of minutes and are distributed across several hundreds of kilometres. We do have strategies to mitigate this problem – but we want to be thoughtful about this.

Some kind of A/B deployment or a time triggered deployment could reduce the risk of stopping all subsidiaries at once, but we need time to think about and be patient 🙂

Another point is scaling up. Currently we allow only one job to be run, because we don’t have enough nodes. But I’ve ordered another Turing pi board! Beside that we already have some RPi4 in production. Turing Pi plans to start to develop a CM4 compatible board after the next batch of boards is completed. Most likely we’ll order one.

Thanks a lot Turing Pi Team – your product helps us with every merge request.