Evolution of GIS Attribute Data from Collection to Cleaning

Now in its second year, Azavea’s Summer of Maps Program has become an important resource for non-profits and student GIS analysts alike. Non-profits receive pro bono spatial analysis work that can enhance their business decision-making processes and programmatic activities, while students benefit from Azavea mentors’ experience and expertise. This year, three fellows worked on projects for six organizations that spanned a variety of topics and geographic regions. This blog series documents some of their accomplishments and challenges during their fellowship. Our 2013 sponsors, Esri and Tri-Co Digital Humanities helped make this program possible. For more information about the program, please fill out the form on the Summer of Maps page.

Most research involves data. Data is rarely perfect. It must be tailored in order to draw significant conclusions. This blog post tracks data from its conception to its final, “mapable” form.

This summer, I had the opportunity to engage in the transformational phases that data passes through before becoming a map. Working in partnership the Pennsylvania Horticultural Society this summer, the data exhibited three major turning points:

1. Data Collection: observing and understanding data
2. Transcribing Data: digitizing data
3. Data Cleaning: preparing data for drawing significant conclusions and GIS visualization

My involvement with these transformative phases began with paying a visit to the trees of South Philly to observe factors such as bark damage, litter and overall tree condition.

1) Data Collection: Methods Impact Analysis

The PHS handed out survey sheets to trained Tree Checkers (volunteers trained by PHS as part of the Tree Tenders program) who recorded yes/no observations in a list of categories pertaining to the young trees and their environment. In this project there were two primary types of data collected:

Spatial data: All trees were associated with an address or intersection.
Qualitative data: Observations of non-numerical characteristics pertaining to the young trees.

The scenarios below demonstrate how the collection process provided additional insight and value to the data. It hints at strategies to convert data so that significant conclusions may be drawn:

I) Location Precision Reduces Spatial Inaccuracy and Redundancy

The location of all trees was recorded by address or intersection. Sometimes multiple trees (5 or 6) were recorded as existing at a single address. Visiting the young street trees of South Philadelphia, it quickly became clear that trees were rarely planted directly in front of a house; rather they were planted in a tree pit. Taking into consideration the radius of trees from their recorded locations, I aggregated the data to Census Block Groups. Using a dot-density symbology, I created a visualization in which the locations of multiple trees did not overlap. There was an average of 6 trees per block group. Aggregating to the block group level allows researchers to conduct exploratory regressions in search of social factors that correlate highly with tree survival, or conversely, tree mortality.

II) Qualitative Data Field Definitions

After initially reviewing the data, one qualitative field in particular posed a problem. The “Tree Not Found” field refers only to instances in which a dying or dead tree had been requested for removal by the nearest homeowner. However, surveying the trees of South Philly elucidated an unexpected meaning for the field.

Occasionally, the PHS’ young street trees were difficult to distinguish from other neighborhood trees. Trees marked as ‘not found’ also signifies instances in which the tree could not be located or identified. In other cases, the planting location of the tree was known, but was the tree had been removed and was missing entirely, reason for removal unknown. Since the category had taken on a dual meaning, it became clear that without further details the ‘tree not found category’ (less than 4% of total trees) reflected neither the survival nor death of a tree. After consulting representatives from the PHS, I removed records marked with as ‘not found’ from the mortality analysis.

This example demonstrates the importance of recording precise definitions of the different data categories. It also suggests that a category definition is subject to evolve during the data collection process. This too is note-worthy and may impact the analysts’ strategy for manipulating the data and drawing conclusions. Before and after data collections are opportune times to create and revise a key for the various data categories.

2) Transcribing Data: Consistency is Key

Tree Checkers surveyed the young street trees planted as part of the Plant One Million project in their chosen or local Tree Tending Districts. Many volunteers generously spent their free time recording the tree checking data in a digital table. When more than one person volunteers their time to transcribe data, it is helpful to have a data entry key to minimize variation from record-to-record. Better yet, a customizable database application would enable the use of drop downs, radio buttons and date fields instead of free form text boxes and ensure consistency. This is applicable to qualitative and spatial data.

Recording addresses in the appropriate format for geocoding is a great time-saver. Knowledge of the appropriate format may be used to create a geocoding database application. This geocoding key lists incorrect and corrected versions of addresses for geocoding. An additional note not covered in pamphlet: record the street direction (N, S, E, or W) when applicable.

3) Cleaning Data: Communication, Efficiency Tips, and Converting Qualitative Fields to Binomials

Communication may be the most essential part of preparing data for spatial analysis. Just as those dedicated enough to collect and transcribe the data should record precise definitions of the data fields, it is important for the spatial analyst to inquire about the data with its intended use in mind. This communication creates a window of opportunity to correct any misconceptions and reduce the risk of error.

There is an array different strategies for cleaning data. Some prefer to operate in Excel, others in ArcGIS. DataWrangler, a data-cleaning efficiency tool, is especially helpful for reformatting data. Slides 14 – 16 from the Data Cleaning and Visualization Tools for Nonprofits presentation discuss the pros and cons of a three different strategies.

In ArcGIS, the process is simple. Below are steps to eliminate small oddities from the data set.

1) Replicate the category or “field” you plan on cleaning. This allows for double-checking the cleaned field beside the original field.

- - Create a new field of the same type as the original field (e.g. short, long, double, etc.)
  - Using field calculator, set new field equal to original field

2) Browse the unique values of the new field in the select by attribute tool. This will list the different variations in the field box.

3) Decide on the format the preferred variation. For the remainder of the revisions, use the Select by Attribute and Field Calculator tools (a powerful duo).

- - Select by attribute all of the variations that signify a single outcome or observation.
  - Right-click the new field header and open field calculator. Type preferred variation and press okay. A window may pop up. Select okay again.

4) If the field only needs one minor change, it may be made in field calculator using the following command (no “select by…” necessary):

- - replace([FLD_name],”originalcharacter”,”NewCharacter”)

5) Remember to clear selection before repeating the process!

6) Double-check your new field against the original field to make sure there are no discrepancies.

7) Be aware of nulls. Null are often different from zeros and the like. It is important to preserve them.

Once the fields are clean, data may be further simplified by converting faux-Boolean fields into real Boolean fields. Boolean or binomial fields allow for maximum flexibility when conducting spatial analyses as they can be used for both raster and vector forms.

8) Create a toolbox in your geodatabase. Right-click the toolbox and select make a new model.

9) Open the models diagrammatic view by right clicking on the model in the toolbox and select edit. Locate the appropriate tools and layers, dragging them into the model as you do so. To open and adjust a tool, double-click the shape. I added the fieldnames as parameters, as they are subject to change, and to simplify the process of double-checking.

10) If there are multiple outcomes or observations of interest in the same layer, select everything following the input feature (Soils). Copy and paste it. Change fieldnames (small light yellow circles, the tool variables) to the new variable of interest from left to right. Start by typing the new field name in the small yellow bubble attached to ‘Add Field.’ Select the same new field name from the dropdown menus. It may be necessary to select a new input feature from the dropdown menu, then back to the intended input to refresh the fields available in the Field Name dropdown menus. Adjust the select by attribute tool to highlight the variable of interest. Finally, connect the last circle from the original model (in this case ‘Selection cleared’) to the first tool of the new row.

11) Press run, and voilà! You have created a new binomial field, ready for mapping and spatial analysis.

4) Final Recommendations

Improving Qualitative Data Collection: Conduct research in advance to identify best practices in tree related data collection. Design the data collection process with the future use and utility of data in mind. Consider technical data collection methods like use of a mobile device and application like OpenTreeMap or FileMaker. Last, create a process that standardizes input processes for use by multiple individuals.
Improving the Collection of Spatial Identifiers: Instead of relying on geocoding, which associates many trees with a single intersection or parcel address, record each tree’s unique set of latitude and longitude coordinates. Most smart phones have the capacity to identify precise latitude and longitude coordinates.
Consider Additional Scientific Methods for Assessing Tree Health: By the time data collection begins most of the decisions about the data have already been made, namely which observations to include in the study. The results of a project or study are most dependent on this delicate process. To this end, I will conclude with one last anecdote:

Joining a representative from PHS in the field survey, we noticed that the over and under watering of trees were measured only by observing the present wetness of the soil. While collecting data, we saw that all of the soil was wet, as it had rained much of the previous week. The strategy for measuring the entire category (by wetness of soil) was weather dependent, and thus rendered the entire category irrelevant.

Leaves dying towards the top of a tree suggest that a tree is likely under watered, and from the bottom, overwatered. This observation would prove a more reliable indicator of whether a tree was consistently over watered or under watered (as opposed to soil-wetness, which only indicates how the tree was watered near the moment of data collection). This experience demonstrated that guessing which factors about the surrounding environment may impact the dependent variable is ineffective. When choosing variables to survey, it’s best to start by observing characteristics directly associated with the dependent variable that display symptoms of particular sources of stress. Trends in these observations will then hint at other explanatory variables and future avenues of study.

Final note: Collecting qualitative data can require a subjective judgment of a characteristic (such as leaf color or texture). All of the data must be judged on the same scale. In the context of this project, it would be practical to bring similar young healthy leaves of each species along for comparison.

In sum, the thoughtful consideration and design of collection techniques in advance ensures that the lengthy time spent on data collection will not be wasted. These recommendations will help to add additional value to the data and maximize its utility in future analysis.