DBT on Google Cloud Platform: DBT modeling

 I decided to model this raw sales dataset into a star schema with a fact table and multiple dimensions tables. 

Here is the DBT project files: https://github.com/Gregory-Bagramyan/dbt-gcp-project

The raw data comes from Kaggle as a csv. This csv is stored in the seeds folder and is loaded into Big Query through the “dbt seed” command.

The modeling starts with the first layer, the staging layer which is the stg_sales view. 

You can find the SQL of the stg_sales view here: https://github.com/Gregory-Bagramyan/dbt-gcp-project/blob/main/models/staging/stg_sales.sql

In this table I:

  • casted all the columns into the desired data type explicitly

  • transformed the data field from dd/mm/yyyy to a data type with the yyyy-mm-dd format

  • made sure that the empty postal codes are null (postal codes are the only column that contained empty rows)

  • renamed all the column so they are easily queryable.

THE FACT TABLE

At the center of the model I have the sales facts table, fct_sales, as it’s the table that answers the question of what happened numerically through the sales amount.

You can find the SQL for the fct_sales here: https://github.com/Gregory-Bagramyan/dbt-gcp-project/blob/main/models/fct/fct_sales.sql

I materialized fct_sales as a table using the dbt_projects.yml file: https://github.com/Gregory-Bagramyan/dbt-gcp-project/blob/main/dbt_project.yml

In the begging of the fct_sales.sql file I leveraged the configurations “partition_by” and “cluster_by” that are Big Query specific and optimize future queries. “partition_by” is done by date so filtering by date if optimized. And “cluster_by” is done by “customer_id” and “product_id” so the joins on dim_customers and dim_products are optimized. 

From there I kept all the business IDs in this table:

  • order_id as a degenerate dimension meaning it’s useful to have it but as it doesn’t have further information like status for example it doesn’t deserve it’s own dimension

  • customer_id

  • product_id

I kept the raw dates, for quick filtering and partitioning, along side the dates keys for joining on the dim_dates table, that exists for easy intelligence like knowing what quarter is a date or if it is a weekend or not. 

Then I selected the location columns to join on dim_locations and extract the location key (location_sk) that I created using the package dbt_utils and the surrogate key function.

And finally the last piece of the fct_sales table is the measure sales.

THE DIMENSION TABLES

There are 4 dimension tables in this model:

dim_products and dim_customers simply reference the staging view as picks the rights columns, the raw data already contained a key so I decided to use it instead of creating a new one. Both tables select the latest info in the raw data using row_number window function. For more mature projects we could implement the slowly changing dimensions type 2 to keep all the historical information. 

For dim_locations I had pick the location columns and create a surrogate key out of them using the surrogate key function from dbt_utils. 

Finally dim_dates is a date dimension that exists for date intelligence like knowing what quarter is the date or if it is a week end or not. 

TESTS 

I implemented the following generic tests on all the columns that were relevant:

  • not null

  • unique

  • and reference 

Note: for the reference tests I removed the argument “argument” as it didn’t exist in the version I chose for the project, even though it creates a warning in the latest version of DBT that requires this argument. 

And I used a schema.yml file for that: https://github.com/Gregory-Bagramyan/dbt-gcp-project/blob/main/models/schema.yml 

Along side the test I added a documentation for the different tables using the description parameter. 

Next
Next

DBT on Google Cloud Platform: Infrastructure