As many WPDx users have noted, the WPDx standard includes several open text fields, including #water_tech, #water_source, and #status. Unlike fields that require a choice from a list of values or a specific number format, users can provide any text in these fields to describe a type of water point and its status.

While open text has challenges, making these fields open text allows WPDx to share an incredible amount of useful and rich information. If we required certain terms, any organization that wasn’t ready to use those terms (or didn’t in the past) would be limited in what data they could share. As a result, we would all miss out on great data just because they weren’t using a certain set of terms. Further, using generic terms (i.e. hand pump) could hide rich information such as the type of hand pump. If you wanted to compare Afridev pumps against India Mark II pumps, that information might be hidden by a generic term like “hand pump”. For these reasons, these open text fields are valuable to WPDx. However, they also pose challenges.

A New Tool:

The main challenge of these open text fields is that the values can be difficult to categorize. With thousands of unique values for each of these fields, it can be difficult to compare one type of water point with another, or see how common a certain problem might be. To address this challenge, WPDx is rolling out a new categorization tool, developed in partnership with HP Enterprises.

This tool allows you to define your own categories and apply them to the WPDx dataset. You will “train” the tool by uploading a dataset with some sample WPDx values and the corresponding categories you want those values to fit into. You can then apply this training document to the entire dataset. The tool is smart enough to make educated guesses on slight variations, whether that be misspellings or different ways of capturing the same information.

With this tool, users can benefit from the rich and inclusive nature of the open text fields while also easily being able to categorize data to meet their needs. 

Moving Forward:

While this tool helps users organize the diversity of data that currently exists, it would be even more efficient to begin standardizing the actual data in the future. With WPDx’s new process for updating the standard, there is now an opportunity for interested organizations to propose recommended vocabularies or other approaches to help standardize open text fields moving forward. Click here to learn how you can propose changes to the standard and help improve the way that water point data is standardized.

How To Use the Tool:

  1. You can use the template here to define your own categories for #water_source and #water_tech, or here to create categories for #status.
  2. Visit Categories.WaterPointData.org to access the new tool.
  3. Select “Water Source Types” or “Status” from the top black menu depending on what you want to categorize.
  4. Upload your training document in the top box.
  5. Visit data.waterpointdata.org and download the data you want to categorize.
  6. Leave only the #row_id and the data to categorize:
    • #water_source and #water_tech; or
    • #status
  7. Click “Train and Classify”

If you have any questions or feedback, contact us at info@waterpointdata.org

Credits:

Thank you to HP Enterprise and the Top Coder community for support in developing the tool and Joe Cook and Sheena Lahren for providing outstanding sample data for exploring the status categorization.