How can we help you today? How can we help you today?

Gotchas when linking static data in source control

This isn't specific to Redgate source control but relates to a problem when putting data in a source control system. 

We took 3 strategies to placing data into source control for our data warehouse.
  • Small static dimensions
  • Seed records for facts and dimensions
  • Most recent 'n' records for facts and dimensions where n <10000
The problem that snuck up on us was the rate at which the repository expanded.  It wasn't until the repo hit 6GB that developers started to notice problems.  It was the version history that was causing the problem.

Temporarily our developers resorted to doing git pull --depth 1 to ensure only the most recent version was retrieved and not the full version histories.

There isn't a simple way of pruning down to a limited number of versions in git.  The syntax to do so is painful and obscure.

At least with Redgate source control you can disable update checks on static data and even unlink the data altogether.
Dave60103
0

Comments

5 comments

  • AlexYates
    Yeah - try not to put that much data in source control.

    With that much data I'd be thinking about alternative ways to track the data in the larger tables.
    AlexYates
    0
  • Dave60103
    We considered AWS S3 for the larger data sets.  Azure blob storage and Google cloud storage.  The consideration for us is that there is deliberate separation of environments (DEV, TEST, UAT, PROD) explicitly to stop PROD data permeating down through the environments.
    Realistically data in DEV environments is only to allow basic development so versioning of the data is not a requirement for us.
    Dave60103
    0
  • AlexYates
    Perhaps I'm misunderstanding something?

    If versioning of data is not a requirement, why is your repo so big? It sounded like you were putting a lot of data into source control.
    AlexYates
    0
  • TheSQLGuru
    Going one step past Alex's statement, I will ask were you putting a lot of data into source control that didn't NEED to be there? The ONLY thing I would think would change over time (certainly by any reasonable amount) would be #3, the most recent rows. What was your business case for storing that data as "static" in a source control system?? I sure as heck couldn't think of one off the top of my head.
    TheSQLGuru
    0
  • Dave60103
    The flawed implementation of the idea was to have a one-stop shop for building a server and database to the point where it could be used for development and unit testing.  The intent was for data used for reference and application configuration would reside in there.
    The problem came from three sources.
    • Not understanding the data model fully.  It's a undocumented system riddled with tech debt.  This led to more data than was anticipated being included for one particular table.
    • Not fully understanding the limitations of source control
    • Organisational inertia.  We used the tools we had available.  When all you have is a hammer.......

    Dave60103
    0

Add comment

Please sign in to leave a comment.