Simple, free resources to help you build a better WordPress website.

Sign up for our newsletter.

Formatting Large Sets of Migration Data with RegEx Matches

Data migration is an essential task when converting static sites into WordPress themes. For simple static content sites, it’s just a few quick copy/pastes into the content area on a post/page and you’re finished. However, if you encounter a more complex site, you might find yourself with a lot of hard-coded data that is not as easily imported.

More than a few times, I’ve encountered oodles of lines of hardcoded values from JavaScript that need to be inserted as a custom post type or an admin option value. One time, I received an email with a copy/paste from an Excel spreadsheet with values to import into WordPress. The original Excel file was no longer in existence and I was left with a blob of text that would take seemingly hours to format into PHP for processing within WP.

It’s embarrassing to admit, but I used to spend hours formatting line-by-line. Then, I learned about regular expressions. I started replacing single expressions like ^\n or \t and that saved a lot of time. But, the lightbulb moment came when I learned about generating matches from within an expression. Instead of doing piecemeal replacements with single expressions, multiple matches in a greater expression could format my entire data set in mere minutes.

In no way do I consider myself to be a regex expert, but I have learned to make my messy expressions work for data migration operations. There are a few steps that you need to take when your screen is filled up with data that is not formatted properly for migration. Certainly, you can process this text block via server-side code, but that’s not optimal. It’s best to format your data right in your text editor. I use Sublime Text 2.

Identify Patterns

The first thing to do when looking at a large set of seemingly insurmountable data as a whole is to find a pattern of delimiters. Typically, there is some sort of consistent separator to target for individual values within a single record. If your data is stored as values in another language, the patterns are generally obvious.

Hardcoded JavaScript values:

var earthquakes = [];

earthquakes[18] = {};
earthquakes[18].Src = 'ci';
earthquakes[18].Equid = '15262513';
earthquakes[18].Version = '0';
earthquakes[18].Datetime = 'Tuesday, December 11, 2012 23:44:08 UTC';
earthquakes[18].Lat = '33.4455';
earthquakes[18].Lon = '-117.4315';
earthquakes[18].Magnitude = '1.9';
earthquakes[18].Depth = '15.20';
earthquakes[18].NST = '18';
earthquakes[18].Region = 'Southern California';

earthquakes[16] = {};
earthquakes[16].Src = 'ak';
earthquakes[16].Equid = '10617595';
earthquakes[16].Version = '1';
earthquakes[16].Datetime = 'Tuesday, December 11, 2012 23:44:04 UTC';
earthquakes[16].Lat = '61.7527';
earthquakes[16].Lon = '-146.9591';
earthquakes[16].Magnitude = '1.6';
earthquakes[16].Depth = '29.60';
earthquakes[16].NST = '16';
earthquakes[16].Region = 'Southern Alaska';

The value delimiters for this example are the single quotes around the JS property values.

Identify Values Within The Patterns

The next step is to identify how many values are in a record and what they represent. I typically keep a separate text file open and notate these values there. For example:

  1. Source
  2. Earthquake ID (unique)
  3. API Version
  4. UNIX Date/Time
  5. Latitude
  6. Longitude
  7. Magnitude (BOOM!)
  8. Depth
  9. Number of Reporting Stations
  10. Region

This list is a valuable legend of your data when building out your replacement template, especially if you have a few dozen values per record.

Create A Replacement Template

One of the less obvious things about Sublime Text 2 is that you can replace more than one line at a time. These multiple lines can be pre-formatted as well. Once I figured this out, I started building out visual replacement templates rather than inserting tab and newline characters in a single string.

Since I’m converting all of this earthquake data to PHP for use in WordPress, I built out the following template. Each earthquake will become an array element that’s primed for iteration on the server-side.

$earthquakes[] = array(
	'source'    => '$1',
	'eq_id'     => '$2',
	'version'   => '$3',
	'date_time' => '$4',
	'latitude'  => '$5',
	'longitude' => '$6',
	'magnitude' => '$7',
	'depth'     => '$8',
	'nst'       => '$9',
	'region'    => '$10'
);

The $N values correspond to the list I created in the previous step. These represent the matches generated by the expression. This preformatted string will serve as our replacement string. The dollar values will be dynamically replaced with the matches from the expression.

Write An Expression

This post is not about how to write regular expressions. There are quite a few great resources out there from which to learn by doing simple searches. However, if you don’t know how to produce specific matches within a greater expression, simply enclose a section of your regex with parentheses. Each parentheses will correlate incrementally with the dollar values in the replacement template.

Here’s the regex I used to parse the earthquake data. This has ten specific value matches using ‘(.+)’. You will notice the matching expressions are nested within single quotes, the delimiters identified in the first step.

\nearthquakes.+\{\};\n.+'(.+)';\n.+'(.+)';\n.+'(.+)';\n.+'(.+)';\n.+'(.+)';\n.+'(.+)';\n.+'(.+)';\n.+'(.+)';\n.+'(.+)';\n.+'(.+)';

Though this is messy and could be optimized a number of ways, it doesn’t matter. As long as it works, it’s fine since we’re not sending this regex to any sort of production code. I could easily match these values with a substantially shorter expression, but most text editors aren’t smart enough to identify complex matches. Hence, there are duplicate expressions for each value.

This regex produces a match for each record of data and nested matches for our parenthetical expressions.

Replace All

Once the data records have been properly matched, it’s time to replace each record with the template we created. I suggest clicking the replace field and selecting all to clear out any hidden multiple lines. Paste in the template. You will only see the last line, but don’t worry, it’s there. Hit Replace to replace one record and QA the result. Once everything looks good, hit Replace All and take five. You deserve a break.

Once I have all of the data formatted in PHP, I can build out a foreach loop that will insert the data into WordPress. This typically resides in a plugin that is only triggered by an obscure query string and disabled after the migration is completed.

Try It Yourself

In conclusion, it’s a good practice to save the regex, replacement template, and values legend in a separate file so you don’t have to redo all of this work later if you find a mistake. If you’re a bit shaky with regular expressions, head over to regexpal or regex fiddle and test your regex skills on-the-fly.

Downloads

2 Comments

  1. Avatar
    Matt says:

    Hey good idea. I made pretty good headway on one of my migration projects with a find and replace function of my text editor, but I hadn’t thought about using RegEx. Thanks for the brief tut.

    – Matt

Leave a Reply

Your email address will not be published. Required fields are marked *