Skip to content

Creating a Fix File

Cole Herman edited this page Mar 12, 2026 · 4 revisions

A fix file, just like your headers config file, is specific to a collection. You can write specific fixes using pre-defined methods (which just require the name of the method and a little Regex) to do things like replace all values under a header with a single correct value or replace 'http://...' with 'https://...' in a URI. It will automatically make a backup of the file, just in case your fixes cause some issue.

To actually set it up,

  1. Create a file named "[headers config]-fixes.yaml" under od2_md_scripts/config.
    • This file MUST be named exactly the same thing as your headers config file, with '-fixes' added before the .yaml extension, or the program won't detect it to use.
  2. Add yaml fields with each fix you want

Step 2 is actually somewhat complicated, so it's broken down by each fix type below.

There are 3 methods (chosen by 'type: ' in the yaml) you can use to do auto-fixes.

Quick summary:

  • 'strip': remove leading and trailing whitespace
  • 'enforce_string': replace every value with a given string
  • 'regex_replace': select a pattern (part of or a whole value in a cell) and replace it with another pattern

Method 1: strip

This will strip leading and trailing whitespace from all values in the column. All you need to do is enter the type and column as shown above, and it will be applied to every value under the header.

Example: in the uo athletics collection, the 'institution' column often has an extra space after the each value. Usually it looks like this "http://id.loc.gov/authorities/names/n80126183 " instead of "http://id.loc.gov/authorities/names/n80126183". So the fix for that would be:

fixes:
  # fix institution having extra space at the end
  - type: strip
    column: institution

Method 2: enforce_string

This will replace every value under a header with the same string. You just enter the type and column, as we did with Method 1, and the correct value will be retrieved from your headers config automatically. This can be confusing at first, so think of it this way: enforce_string is used when you know exactly what value you want for a field and it never changes. You would already be checking for that exact value in your headers config, so we just take that value and set all the cells to have it. This is why you don't actually write the value you want in the fixes file: it's already written in your headers config!

Example: in the uo athletics collection, the 'license' column was often set to the wrong type, an nc instead of nc-nd license. It would look like "http://creativecommons.org/licenses/by-nc/4.0/" when it should be "http://creativecommons.org/licenses/by-nc-nd/4.0/". And even if the column was left blank this would fix it, since we are just setting the value we want automatically. Here's how that fix looks like in the yaml file, including Method 1 so you can see multiple fixes together:

fixes:
  # fix institution having extra space at the end
  - type: strip
    column: institution
  # fix license being wrong type
  - type: enforce_string
    column: license

Again, note that even though enforce_string is replacing cells with a specific value, we don't write that value here because it's already in the headers config.

Method 3: regex_replace

Regex_replace is the most powerful method, and the hardest to use. Regex_replace can append values to the end of a cell, insert a letter in the middle, and basically anything else you can think of. It has 4 fields in the yaml, unlike the 2 that strip and enforce_string have.

'type' and 'column' are the same as before.

'pattern': this tells the program what to select

'replacement': this tells the program what to replace selected values with

We'll give two examples here to show the variety of uses you might have for regex_replace.

First, in uo-athletics the 'location' field requires a URI from geonames.org. This data source always uses 'https://', but our data frequently has 'http://' in the URI without the 's'. To fix this, we want to select just that part of the string to edit, so we use Regex. It looks like this in the yaml:

fixes:
  # fix location using http over https
  - type: regex_replace
    column: location
    pattern: '^http://'
    replacement: 'https://'

Here we choose the part of the value to replace (we are only taking the 'http://' part) and give the replacement, which is just 'https://'. Notice that the whole link is much more than this, it's something like 'http://sws.geonames.org/1234567', but we only care about the start and so we only select that in the pattern. Only values in the pattern will be edited by the replacement. This is why we use Regex -- we want to be able to select the relevant parts and leave other parts, especially those that could be different every time like IDs, alone.

Another example of regex_replace is if we wanted to append a file extension to the end of a cell. In uo-athletics, the 'file' field should always end in .tif. But the data often is missing the extension. We can use regex_replace to append .tif to the end like so:

fixes:
  # fix file column missing file extension (append .tif to the end of the ID)
  - type: regex_replace
    column: file
    pattern: '^(.*?)(?<!\.tif)$'
    replacement: '\1.tif'

WARNING: you must exclude values from the pattern that you don't want to append to. '.tif' is excluded to prevent '.tif.tif' results.

Notice that we are specifically excluding values that already end in '.tif'. Imagine if half of the cells in the column ended in '.tif', and half didn't. If we applied the fix to every cell in the column, we would end up with a bunch of '.tif.tif' double extensions because it would append every time. So we use the negative look-ahead in Regex to make sure that we only include files that don't end in '.tif'.

Here's an example of a layout with real data and fixes from uo-athletics:

# No need to specify the string when using enforce - it will read the correct value from uo-athletics.yaml, since it 
# expects only one specific, unchanging value
fixes:
  # fix institution having extra space at the end
  - type: strip
    column: institution
  
  # fix license being wrong type (usually nc instead of nc-nd)
  - type: enforce_string
    column: license
  
  # fix location using http over https
  - type: regex_replace
    column: location
    # select the start of the geonames url
    pattern: '^http://'
    # replace http with https, keeping the rest of the url intact
    replacement: 'https://'

    # fix file column missing file extension (append .tif to the end of the ID, which is what's almost always there instead)
  - type: regex_replace
    column: file
    pattern: '^(.*?)(?<!\.tif)$'
    replacement: '\1.tif'

You can use these methods to fix most simple issues, and with some creativity they can do quite a lot. Anecdotally, for some of the more annoying uo-athletics fixes they've sped up fixing the metadata by ten times. But they can be a bit annoying to get right, so it's your choice to use this or not!

Clone this wiki locally