Cleaning Address Data for Legislative District Matching

Cleaning Address Data for Legislative District Matching

Follow this tutorial to get started learning powerful tools to quickly clean and format addresses for precise address to legislative district matching.

Cleaning and organizing addresses is crucial to any organization maintaining location data. Whether you’re a nonprofit maintaining a list of your member’s addresses, or a commercial retail company with hundreds of store locations, your data has to be uniformly formatted.

At Azavea, we maintain a District Match service that is used by nonprofits and commercial companies alike to match their addresses to legislative districts and elected officials. District Match leverages Cicero, a comprehensive database of elected officials in all 50 states and over 300 local jurisdictions that includes official’s contact and social media information. With Cicero’s District Match, organizations can power advocacy campaigns and better understand and empower their constituents.

We frequently hear from customers that they need help cleaning and formatting address data. This tutorial focuses on skills around organizing data using some useful tools. While there are no magic beans to solving every data problem (that we know of!), we hope the skills introduced in the tutorial will save time and leave readers more fluent and efficient working with data. We will try to point out quality resources for deeper dives and further learning along the way.

Getting started with Sublime

You will need two tools to complete this tutorial:

  1. You should have software to work with spreadsheets installed on your computer. MS Excel, Google Sheets, or LibreOffice will all work.
  2. Sublime Text. A text editor that we’ll use to help you clean your address data quickly and efficiently.

Open up Sublime text editor once you have downloaded and installed it.  You can use Sublime to write code in many programming languages and view and edit data in most formats (we’ll get to that). Sublime is easy to learn and good for lots of things, ranging from complex to fairly simple, and it is useful for cleaning data so we will use it here.

Cleaning messy address data

Copy and paste the sample messy address data (generated with mockeroo) into a new file in Sublime. Save it somewhere you can keep track of, like: projects/data-cleaning-blog/address.csv

Pen Howes  72557 Crescent Oaks Parkway week 8 Baltimore,MD 21290
Berti Heindrick auctor gravida sem 0606 Autumn Leaf Center toll 11 Sacramento,CA 94237
Pooh Mitro ultrices posuere cubilia 81823 Maywood Hill also 271 Reno,NV 89595
Gizela Madeley  007 Summit Drive amps 3 New York City,NY 10260
Bord Malthouse pretium nisl ut volutpat 3918 Declaration Trail fend 380 Fargo,ND 58106
Min Darnbrook quam suspendisse potenti 4 Laurel Trail aces 317 Saint Louis,MO 63169

Lines and line numbers are important to computers. When writing in a regular document, the sentences and paragraphs spill over to the next line when they run out of space or the font increases. In a computer program or data file everything belongs on a specific numbered line.

In our messy address data, each unformatted address is already on it’s own line. Open Sublime’s Find & Replace tool found by clicking Menu → Find → Replace or ctrl + h. Try finding a few words and you will see them highlighted in the Sublime editor.

Finding patterns with regular expressions

In Computer Science, regular expressions (also known as Regex) are a powerful tool for matching patterns used in many types of software (like Sublime) and programming languages. Here, we will use them to quickly clean and format your address data. Fundamentally, they involve two steps:

1) Designate a pattern, and

2) Use that pattern to find matching instances in your data

Here is how they work:

1) Some Examples of Patterns
# same as a text match                            /abc/  → matches 'abc'
# + match 1 or more of character   /ab+c/ → match abc, abbc, abbbc, etc.
# * match 0 or more of character   /ab*c/  → matches ac, abc, abbc, etc.
# ? match 0 or 1 instance of character       /ab?c/ → matches ac or abc
# {6} match 6 of character                  /ab{6}c/ → matches abbbbbbc

2) More Common Regex Syntax
\s → match single space
\d → match numeric digit, \D match non-digit
\w → match alphabetic character \W match non-alphabetic character
\d\d\d  → match 3 numeric digits
\d{3}  → match 3 numeric digits
\s{3,5} → match 3-5 spaces
\n, \t → newline, tab

3) Alternative Regex Syntax 
'regex', /regex/, r"regex", r'regex',/regex/g

4) Prepending or Appending to Matched Pattern
Use $0 to return the matched pattern when using Find and Replace in Sublime

Formatting address data with regular expressions

The key to formatting messy text data into a csv file is to insert commas to create columns and newline characters (\n) to create rows. This is how we will transform the messy address data:

Figure 3 -  Transforming Data
Figure 1 –  Transforming Data

First, add a header row to the spreadsheet with target columns.

firstname, lastname, address-prefix, address1, address2, city, state, zipcode

Next, when using District Match and working with data in general, it is important to have a unique ID for each record. You can add this column easily in most spreadsheet software with the autofill tool.

Figure 4 - Header Row and ID Column
Figure 2 – Header Row and ID Column

Now let’s parse the rest of the address by looking for patterns. With the csv file open in Sublime, enable the Regular Expressions tool (press the '.*' button at left of the Find & Replace window). Start with the state abbreviation (2 capital letters) and zipcode. Remember to always preserve leading zeros in zipcodes.

Find: [A-Z]{2}                                        Replace: $0,

Note: As discussed above $0 returns the matched segment, here we append a comma.

To find the name of the city, we look for one or more words followed by a comma and state abbreviation:

Find: [a-zA-Z ]*,(?=[A-Z]{2})                         Replace: ,$0

Note: (?…) denotes a positive lookahead, meaning this segment matches our target but we do not include it in the match. (Read more about lookahead and lookbehind here.)

Figure 5 - Separate Columns for City, State and Zipcode
Figure 3 – Separate Columns for City, State and Zipcode

Next, names in this data consist of first and last names, which are two strings separated by a space that follow the comma after the id field. Capture this pattern and append a comma.

Find: ,(\w+) (\w+)                                    Replace: $0,

Figure 6 - Separate Columns for First and Last Name
Figure 4 – Separate Columns for First and Last Name

Address2 (this portion of the address often represents Apt 4, Suite 1002, etc.) follows the pattern word number. Prepend a comma:

Find: [a-zA-Z]+ \d*,                                  Replace: ,$0

Address1 is the portion of the address that represents the core street address (i.e. 123 Main St) and is the most important for district matching and finding the geographic location of the address. This data has the consistent pattern number word word. Prepend a comma:

Find: \d+ ([a-zA-Z]+) ([a-zA-Z]+)                      Replace: ,$0

Figure 7 - Separate Columns for Address-prefix, Address1 and Address2
Figure 5 – Separate Columns for Address-prefix, Address1 and Address2

Finishing a project with District Match

When your address data is clean, you are ready to use District Match to match it to legislative districts and elected official contact information.

You can upload your CSV directly into the web browser. Choose the legislative districts and elected official data you need, and in a matter of minutes, your project will be accepted and processing will begin.

We hope this tutorial has saved you from manually cleaning hundreds or thousands of addresses! Let us know if you run into any problems. We’re always here to help.

Additional resources:

Tools on regular expressions and pattern matching

Interactively build patterns to match sample data,

Generate of a visual diagram of a regex (try building this sequence of patterns: abc, ab+c, ab*c, ab?c, ab{6}c)

Handy reference for more advanced regex topics like lookaheads and lookbehinds

Most programming languages offer tools for implementing regex, for example here is the Python library:

Unlimited patterns within patterns: recursion

As you work with Regular Expressions, you will find they enable you to match complex patterns that repeat without limit. This is the concept of recursion, which is fundamental both to both computer science and the natural phenomena such as human language. Read more here.