Snowplow on GCP

Snowplow is a highly customisable behavioural data platform, and blessed be that company since their code is open-source (amen).

For some reason, the only guide for implementing Snowplow on Google Cloud Platform was written in 2019, so I think it’s time for an update. Most of the content below was shamefully lifted directly from the aforementioned article.

Pre-requisites

You should have your GCP account with billing enabled.

Host your Snowplow JS tracker file somewhere

  • Register in Search Console your prefered domain name that host the JS file; once done, you can delete the record that was used to verify
  • Follow this guide

Tag up your site

  • Do your thing in GTM

Enable the required services

Go ahead and switch on:

  • Compute Engine API
  • Cloud Pub/Sub API

Install Google Cloud CLI

It makes interacting with GCP easier.

Service account

If you’ve used Compute Engine before, your should already a powerful service account set up for you. Otherwise, you can quickly set one up for yourself.

Set up Pub/Sub topics

Create topics called “good”, “bq-failed-inserts”, “bq-types”, “enriched-good”, and “bad”, though only the first four need subscriptions.

Create the config files

Four files you’ll need, at the minimum. They are:

Create a HTTPS endpoint that connects to stream collector

Create your stream collector template

  • Go to Compute Engine section
  • Create instance template
  • Choose “Set access for each API”
  • Enable Cloud Pub/Sub
  • Under Firewall, select “Allow HTTP traffic”
  • Expand “Advance options”
  • Expand “Management”
  • Under “Automation”, fill in the script below:

    #! /bin/bash sudo apt-get update sudo apt-get -y install default-jre sudo apt-get -y install unzip sudo apt-get -y install wget wget “https://github.com/snowplow/stream-collector/releases/download/2.8.2/snowplow-stream-collector-google-pubsub-2.8.2.jar” gsutil cp gs://your-bucket/your-stream-collector-config . java -jar snowplow-stream-collector-google-pubsub-2.8.2.jar –config your-stream-collector-config

  • Expand “Networking”
  • In “Network tags”, add collector
  • Click “Create”

Add firewall rule

  • In VPC network section, go to “Firewall”
  • In “Target tags”, type collector
  • In “Source IPv4 ranges”, type 0.0.0.0/0
  • Tick TCP, then type 8080
  • Click “Create”

Create a health check in Compute Engine

  • Protocol: HTTP
  • Port: 8080
  • Request path: /health
  • Check interval: 10 seconds
  • Unhealthy threshold: 3 consecutive failures

Create stream collector instance group

  • In “Instance template”, select the template you created
  • In “Health check”, select the health check you created
  • Click “Create”

Create a load balancer

  • In “Network services”, create a load balancer
  • Select “HTTP(S) Load Balancing”
  • On next screen, keep things kosher
  • For Protocol, change to HTTPS
  • In IP address, click “CREATE IP ADDRESS”, then reserve one for yourself
  • In Certificate, create a new one, choose Google-managed, then fill in the domain of your tracker
  • For the backend, select that stream collector instance group you’ve got, then put in 8080 port number
  • Scroll down, select the Health check you’ve created
  • Once done, in your tracker domain DNS configuration, add an “A” record that points to the static IP you reserved

Set up them VMs