Snowplow on GCP

Snowplow is a highly customisable behavioural data platform, and blessed be that company since their code is open-source (amen).

For some reason, the only guide for implementing Snowplow on Google Cloud Platform was written in 2019, so I think it’s time for an update. Most of the content below was shamefully lifted directly from the aforementioned article.

Pre-requisites

You should have your GCP account with billing enabled.

Host your Snowplow JS tracker file somewhere

Register in Search Console your prefered domain name that host the JS file; once done, you can delete the record that was used to verify
Follow this guide

Tag up your site

Do your thing in GTM

Enable the required services

Go ahead and switch on:

Compute Engine API
Cloud Pub/Sub API

Install Google Cloud CLI

It makes interacting with GCP easier.

Service account

If you’ve used Compute Engine before, your should already a powerful service account set up for you. Otherwise, you can quickly set one up for yourself.

Set up Pub/Sub topics

Create topics called “good”, “bq-failed-inserts”, “bq-types”, “enriched-good”, and “bad”, though only the first four need subscriptions.

Create the config files

Four files you’ll need, at the minimum. They are:

Stream collector config
Enricher config
Loader config
Iglu resolver config Store them all in cloud storage, for you’ll need them later.

Create a HTTPS endpoint that connects to stream collector

Create your stream collector template

Go to Compute Engine section
Create instance template
Choose “Set access for each API”
Enable Cloud Pub/Sub
Under Firewall, select “Allow HTTP traffic”
Expand “Advance options”
Expand “Management”
Under “Automation”, fill in the script below:

#! /bin/bash sudo apt-get update sudo apt-get -y install default-jre sudo apt-get -y install unzip sudo apt-get -y install wget wget “https://github.com/snowplow/stream-collector/releases/download/2.8.2/snowplow-stream-collector-google-pubsub-2.8.2.jar” gsutil cp gs://your-bucket/your-stream-collector-config . java -jar snowplow-stream-collector-google-pubsub-2.8.2.jar –config your-stream-collector-config
Expand “Networking”
In “Network tags”, add collector
Click “Create”

Add firewall rule

In VPC network section, go to “Firewall”
In “Target tags”, type collector
In “Source IPv4 ranges”, type 0.0.0.0/0
Tick TCP, then type 8080
Click “Create”

Create a health check in Compute Engine

Protocol: HTTP
Port: 8080
Request path: /health
Check interval: 10 seconds
Unhealthy threshold: 3 consecutive failures

Create stream collector instance group

In “Instance template”, select the template you created
In “Health check”, select the health check you created
Click “Create”

Create a load balancer

In “Network services”, create a load balancer
Select “HTTP(S) Load Balancing”
On next screen, keep things kosher
For Protocol, change to HTTPS
In IP address, click “CREATE IP ADDRESS”, then reserve one for yourself
In Certificate, create a new one, choose Google-managed, then fill in the domain of your tracker
For the backend, select that stream collector instance group you’ve got, then put in 8080 port number
Scroll down, select the Health check you’ve created
Once done, in your tracker domain DNS configuration, add an “A” record that points to the static IP you reserved

An Annamese's Blog