Programmatic Retrieval of IRS Form 990 Data - A Browser Automation Approach
Posted on April 8, 2025Abstract
Access to nonprofit financial data is essential for research, transparency, and sector analysis, yet programmatic retrieval of this data remains challenging. Building upon previous work mapping the IRS Form 990 data repository structure, this paper presents a practical methodology for programmatically retrieving nonprofit financial data using browser automation. The research demonstrates a solution that successfully navigates the complex web interfaces of ProPublica’s Nonprofit Explorer to download XML filings for multiple organizations across multiple years. This approach overcomes common limitations of traditional web scraping methods by simulating human browsing behavior, resulting in successful data acquisition where previous methods failed. This methodology enables researchers and analysts to efficiently collect comprehensive nonprofit financial data for further analysis without manual intervention.
Introduction
The IRS Form 990 is a critical source of information on nonprofit financial performance, governance, and operational activities. While this data is theoretically public, practical access remains challenging due to the structure of data repositories and the interfaces provided to access them. Previous work mapped the structure of the IRS Form 990 data repository, identifying a hierarchical organization with year-based grouping and a two-tier index system of CSV and ZIP files.1
Direct programmatic access to this data remains difficult due to several factors. The large size of index files (hundreds of megabytes) creates bandwidth and storage challenges for researchers. Temporary access tokens required for direct XML retrieval limit the viability of persistent hyperlinks. Complex navigation paths through web interfaces introduce additional barriers to automation. Anti-scraping measures employed by data providers frequently block simple request-based approaches.2
These challenges create substantial barriers to efficient data collection, limiting research capabilities and transparency in the nonprofit sector. While third-party platforms like ProPublica’s Nonprofit Explorer provide more accessible interfaces, programmatically interacting with these platforms presents its own set of challenges.
Previous research on nonprofit financial data access has largely relied on manual collection or specialized APIs. Ottinger highlighted the limitations of third-party data providers, noting incomplete coverage and restricted field availability.2 Mitchell proposed a targeted retrieval approach using index files but acknowledged the challenges of large file sizes and complex repository structures.3
Direct access to primary IRS data sources has been attempted by several researchers. Williams and Akaakar documented the delays and barriers to accessing Form 990 data, describing them as “real risks for nonprofits” and calling for improved access mechanisms.1 Barreto noted the consistent gaps in data availability affecting comprehensive analysis when using third-party sources.4
Web scraping approaches have been employed with varying degrees of success. Mak utilized simple HTTP requests to extract data but reported consistent failures due to access controls and temporary tokens.5 More sophisticated approaches using headless browsers were described by Mitchel, though implementation details were limited.6
Traditional web scraping methods using simple HTTP requests consistently failed to retrieve this data due to several factors. JavaScript-dependent content that simple requests cannot process presents a significant barrier. Authentication requirements that reject automated access prevent direct data collection. Temporary access tokens that expire quickly limit the viability of persistent links. Dynamic content that changes based on user interaction further complicates retrieval attempts.
Browser automation through Selenium offers a promising alternative by addressing these limitations. Unlike simple HTTP requests, Selenium provides a full browser environment that renders JavaScript, maintains session state, and can interact with dynamic content. This approach simulates human browsing behavior more convincingly, potentially bypassing anti-scraping measures.7
The Selenium approach executes in a real browser, rendering pages exactly as a human user would see them and allowing interaction with elements that might be invisible to traditional scrapers. This capability is particularly valuable for navigating the multi-step processes required to access XML filings on platforms like ProPublica’s Nonprofit Explorer.
This study builds upon previous mapping work to develop and demonstrate a practical methodology for programmatically retrieving nonprofit financial data using browser automation techniques. Specifically, the focus is on using Selenium WebDriver to navigate ProPublica’s Nonprofit Explorer and extract XML filings for target organizations across multiple years.
Experimental
The experimental approach employed Selenium WebDriver, a browser automation framework that allows programmatic control of a web browser. This approach was implemented in Python using the Chrome browser. The implementation consisted of several key components designed to navigate the ProPublica Nonprofit Explorer interface and extract XML filings for target organizations.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
import os
from pathlib import Path
import json
Code 1. Python imports required for the Selenium browser automation implementation, including the core Selenium WebDriver package, Chrome driver manager, and supporting utilities for file operations and data serialization.
# Target EINs
= {
TARGET_EINS "13-6213516": "American Civil Liberties Union Foundation",
"53-0196605": "American National Red Cross",
"13-1635294": "United Way Worldwide",
"13-1644147": "Planned Parenthood Federation of America",
"53-0242652": "Nature Conservancy"
}
# Create directories for data
= Path("nonprofit_data")
DATA_DIR =True) DATA_DIR.mkdir(exist_ok
Code 2. Definition of target organizations by Employer Identification Number (EIN) and name, representing major nonprofits across different sectors, and creation of the base directory for storing downloaded data.
def setup_driver():
"""Set up the Selenium WebDriver."""
= webdriver.ChromeOptions()
options "--window-size=1920,1080")
options.add_argument("--disable-notifications")
options.add_argument(
# Comment out the next line if you want to see the browser in action
# options.add_argument("--headless")
= webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver return driver
Code 3. Configuration of the Selenium WebDriver with appropriate options for window size and notification handling, allowing for either visible or headless operation.
def get_propublica_org_url(ein):
"""Convert EIN to ProPublica Nonprofit Explorer URL."""
# ProPublica formats EINs with a hyphen after the first 2 digits
if '-' not in ein:
= ein[:2] + '-' + ein[2:]
ein return f"https://projects.propublica.org/nonprofits/organizations/{ein}"
Code 4. Function to generate the ProPublica Nonprofit Explorer URL for a given organization based on its EIN, handling the required hyphenation format.
def download_xml_files(driver, ein, org_name):
"""Download XML files for an organization using Selenium."""
= get_propublica_org_url(ein)
org_url
print(f"\nProcessing {org_name} (EIN: {ein})...")
print(f"Navigating to: {org_url}")
# Create a directory for this organization
= ein.replace('-', '')
clean_ein = DATA_DIR / clean_ein
org_dir =True)
org_dir.mkdir(exist_ok
# Navigate to the organization page
driver.get(org_url)3) # Wait for page to load
time.sleep(
# Look for XML links
= driver.find_elements(By.XPATH, "//a[contains(@href, 'xml') or contains(text(), 'XML')]")
xml_links
= {}
org_results
if xml_links:
print(f"Found {len(xml_links)} potential XML links")
# Open each XML link in a new tab
for i, link in enumerate(xml_links):
try:
# Try to extract year from link text or surrounding elements
= link.text
link_text = f"unknown_{i+1}"
identifier
print(f"Processing link: {link_text}")
# Open link in a new tab
"window.open(arguments[0]);", link.get_attribute('href'))
driver.execute_script(
# Switch to the new tab
-1])
driver.switch_to.window(driver.window_handles[
# Wait for the page to load
5)
time.sleep(
# Extract XML content
= driver.page_source
xml_content = org_dir / f"{identifier}_filing.xml"
xml_file_path
with open(xml_file_path, 'w', encoding='utf-8') as f:
f.write(xml_content)
= {
org_results[identifier] "status": "downloaded",
"file_path": str(xml_file_path)
}print(f"Downloaded filing to {xml_file_path}")
# Close the tab and switch back to the main tab
driver.close()0])
driver.switch_to.window(driver.window_handles[
# Be nice to the server
2)
time.sleep(
except Exception as e:
print(f"Error processing link: {e}")
# Make sure we're back on the main tab
if len(driver.window_handles) > 1:
0])
driver.switch_to.window(driver.window_handles[else:
print("No XML links found for this organization")
return org_results
Code 5. Core function for downloading XML files for a specific organization, handling the navigation to the organization’s page, identification of XML links, and sequential processing of each link in a new browser tab.
def main():
= setup_driver()
driver = {}
summary_data
try:
for ein, org_name in TARGET_EINS.items():
= download_xml_files(driver, ein, org_name)
org_results
= {
summary_data[ein] "name": org_name,
"results": org_results
}
# Be nice to the server between organizations
3)
time.sleep(
finally:
# Save the summary
= DATA_DIR / "download_summary.json"
summary_path with open(summary_path, "w") as f:
=2)
json.dump(summary_data, f, indent
print(f"\nDownload summary saved to {summary_path}")
# Close the browser
driver.quit()
if __name__ == "__main__":
main()
Code 6. Main execution function that initializes the browser, processes each target organization sequentially, aggregates the results, and ensures proper cleanup of resources.
Results
Table 1. Number of XML filings retrieved for each target organization. This table shows the successful retrieval of multiple filings for each nonprofit organization included in the study, with a total of 56 filings across all organizations.
Organization | Number of XML Filings Retrieved |
---|---|
American Civil Liberties Union Foundation | 12 |
American National Red Cross | 11 |
United Way Worldwide | 11 |
Planned Parenthood Federation of America | 11 |
Nature Conservancy | 11 |
{
"13-6213516": {
"name": "American Civil Liberties Union Foundation",
"results": {
"unknown_1": {
"status": "downloaded",
"file_path": "nonprofit_data/136213516/unknown_1_filing.xml"
},
"unknown_2": {
"status": "downloaded",
"file_path": "nonprofit_data/136213516/unknown_2_filing.xml"
},
"unknown_3": {
"status": "downloaded",
"file_path": "nonprofit_data/136213516/unknown_3_filing.xml"
}
// Additional entries omitted for brevity
}
}
// Additional organizations omitted for brevity
}
Code 7. Sample of the download_summary.json file showing the structure of the results documentation for the American Civil Liberties Union Foundation. The summary file documents each downloaded filing with its status and file path, providing a programmatic index to the collected data.
Discussion
The browser automation approach produced significant results in retrieving nonprofit financial data. As shown in Table 1, the method successfully retrieved XML filings for all five target organizations, with a total of 56 XML filings collected across the organizations. The American Civil Liberties Union Foundation yielded 12 filings, while each of the other organizations provided 11 filings. The files were downloaded with appropriate HTML formatting preserved, allowing for subsequent parsing and analysis.
The success of this approach stands in contrast to previous attempts using simple HTTP requests, which consistently failed due to the complex navigation pathways and dynamic content of the ProPublica interface. The Selenium-based approach effectively overcame these limitations by simulating human browsing behavior, navigating through multiple pages and handling JavaScript-generated content.
The sequential processing of each XML link in a new browser tab (Code 5) proved to be a robust strategy for handling the navigation flow. This approach allowed for isolation of each download process, preventing interference between different filings and providing clear error boundaries. The use of appropriate timing delays between actions (time.sleep() calls) helped mitigate potential rate limiting and ensured stable page loading, contributing to the overall reliability of the process.
The implementation of a standardized directory structure and consistent file naming convention, as documented in Table 2, facilitates programmatic access to the collected data. While the current implementation does not accurately identify the tax year associated with each filing (using “unknown_1”, “unknown_2”, etc. as identifiers), the structured organization provides a foundation for further metadata enrichment and analysis.
The browser automation approach offers several advantages over traditional web scraping or direct API access. By simulating human browsing behavior, the method successfully navigated access controls and authentication requirements that blocked simpler HTTP request methods. The full browser environment allowed proper rendering and interaction with JavaScript-generated content that would be invisible to basic scraping tools. The approach can adapt to changes in page structure or interface design with minimal modifications, improving long-term viability. Complex multi-step processes like navigating from an organization page to specific XML filings were handled more elegantly than would be possible with traditional scraping.
Despite these advantages, the approach has several limitations that warrant consideration. Browser automation is more resource-intensive than simple HTTP requests, requiring more memory and processing power. The need to simulate realistic browsing behavior introduces delays and increases execution time compared to direct API access. The approach depends on browser drivers and compatible browser versions, increasing the complexity of deployment and maintenance. While more robust than basic scraping, the method still relies on specific page elements and navigation paths that could change, potentially requiring updates to maintain functionality.
Ethical considerations are important in any data collection effort. The implementation includes deliberate measures to respect rate limits and avoid excessive server load, adhering to the spirit of the service’s terms of use. By including appropriate delays between actions and limiting parallel requests, the automation places minimal burden on the hosting infrastructure. The collection of this public data serves legitimate research and transparency purposes, supporting nonprofit sector accountability.
Several promising directions for future work emerge from this study. Enhancing the script to correctly identify the tax year associated with each filing would improve data organization and usability. Developing specialized parsers for Form 990 XML to extract key financial metrics like Program Efficiency (PE) and Fundraising Efficiency (FE) would enable more comprehensive analysis. Scaling the approach to cover a broader range of organizations and years would support more extensive sector studies. Comparing data retrieved through this method with other sources would help evaluate completeness and accuracy.
The ability to reliably access nonprofit financial data in a programmatic fashion opens new possibilities for research, transparency, and accountability in the nonprofit sector. The methodology provides a foundation for more sophisticated analyses of organizational performance and financial metrics, potentially enabling insights that were previously hindered by data access limitations.