Gotchas when linking static data in source control

This isn't specific to Redgate source control but relates to a problem when putting data in a source control system.

We took 3 strategies to placing data into source control for our data warehouse.

Small static dimensions
Seed records for facts and dimensions
Most recent 'n' records for facts and dimensions where n <10000

The problem that snuck up on us was the rate at which the repository expanded. It wasn't until the repo hit 6GB that developers started to notice problems. It was the version history that was causing the problem.

Temporarily our developers resorted to doing git pull --depth 1 to ensure only the most recent version was retrieved and not the full version histories.

There isn't a simple way of pruning down to a limited number of versions in git. The syntax to do so is painful and obscure.

At least with Redgate source control you can disable update checks on static data and even unlink the data altogether.

Dave60103 October 14, 2019 20:31

Comments

5 comments

Yeah - try not to put that much data in source control.

With that much data I'd be thinking about alternative ways to track the data in the larger tables.

AlexYates October 18, 2019 10:44

0
We considered AWS S3 for the larger data sets. Azure blob storage and Google cloud storage. The consideration for us is that there is deliberate separation of environments (DEV, TEST, UAT, PROD) explicitly to stop PROD data permeating down through the environments.
Realistically data in DEV environments is only to allow basic development so versioning of the data is not a requirement for us.

Dave60103 October 20, 2019 19:11

0
Perhaps I'm misunderstanding something?

If versioning of data is not a requirement, why is your repo so big? It sounded like you were putting a lot of data into source control.

AlexYates October 20, 2019 20:34

0
Going one step past Alex's statement, I will ask were you putting a lot of data into source control that didn't NEED to be there? The ONLY thing I would think would change over time (certainly by any reasonable amount) would be #3, the most recent rows. What was your business case for storing that data as "static" in a source control system?? I sure as heck couldn't think of one off the top of my head.

TheSQLGuru November 05, 2019 19:36

0
The flawed implementation of the idea was to have a one-stop shop for building a server and database to the point where it could be used for development and unit testing. The intent was for data used for reference and application configuration would reside in there.
The problem came from three sources.
- Not understanding the data model fully. It's a undocumented system riddled with tech debt. This led to more data than was anticipated being included for one particular table.
- Not fully understanding the limitations of source control
- Organisational inertia. We used the tools we had available. When all you have is a hammer.......
Dave60103 November 24, 2019 11:20

0

Add comment

Please sign in to leave a comment.

How can we help you today?

Gotchas when linking static data in source control

Comments

Add comment