Programmatic Retrieval of IRS Form 990 Data - A Browser Automation Approach

Posted on April 8, 2025

Abstract

Access to nonprofit financial data is essential for research, transparency, and sector analysis, yet programmatic retrieval of this data remains challenging. Building upon previous work mapping the IRS Form 990 data repository structure, this paper presents a practical methodology for programmatically retrieving nonprofit financial data using browser automation. The research demonstrates a solution that successfully navigates the complex web interfaces of ProPublica’s Nonprofit Explorer to download XML filings for multiple organizations across multiple years. This approach overcomes common limitations of traditional web scraping methods by simulating human browsing behavior, resulting in successful data acquisition where previous methods failed. This methodology enables researchers and analysts to efficiently collect comprehensive nonprofit financial data for further analysis without manual intervention.

Introduction

The IRS Form 990 is a critical source of information on nonprofit financial performance, governance, and operational activities. While this data is theoretically public, practical access remains challenging due to the structure of data repositories and the interfaces provided to access them. Previous work mapped the structure of the IRS Form 990 data repository, identifying a hierarchical organization with year-based grouping and a two-tier index system of CSV and ZIP files.1

Direct programmatic access to this data remains difficult due to several factors. The large size of index files (hundreds of megabytes) creates bandwidth and storage challenges for researchers. Temporary access tokens required for direct XML retrieval limit the viability of persistent hyperlinks. Complex navigation paths through web interfaces introduce additional barriers to automation. Anti-scraping measures employed by data providers frequently block simple request-based approaches.2

These challenges create substantial barriers to efficient data collection, limiting research capabilities and transparency in the nonprofit sector. While third-party platforms like ProPublica’s Nonprofit Explorer provide more accessible interfaces, programmatically interacting with these platforms presents its own set of challenges.

Previous research on nonprofit financial data access has largely relied on manual collection or specialized APIs. Ottinger highlighted the limitations of third-party data providers, noting incomplete coverage and restricted field availability.2 Mitchell proposed a targeted retrieval approach using index files but acknowledged the challenges of large file sizes and complex repository structures.3

Direct access to primary IRS data sources has been attempted by several researchers. Williams and Akaakar documented the delays and barriers to accessing Form 990 data, describing them as “real risks for nonprofits” and calling for improved access mechanisms.1 Barreto noted the consistent gaps in data availability affecting comprehensive analysis when using third-party sources.4

Web scraping approaches have been employed with varying degrees of success. Mak utilized simple HTTP requests to extract data but reported consistent failures due to access controls and temporary tokens.5 More sophisticated approaches using headless browsers were described by Mitchel, though implementation details were limited.6

Traditional web scraping methods using simple HTTP requests consistently failed to retrieve this data due to several factors. JavaScript-dependent content that simple requests cannot process presents a significant barrier. Authentication requirements that reject automated access prevent direct data collection. Temporary access tokens that expire quickly limit the viability of persistent links. Dynamic content that changes based on user interaction further complicates retrieval attempts.

Browser automation through Selenium offers a promising alternative by addressing these limitations. Unlike simple HTTP requests, Selenium provides a full browser environment that renders JavaScript, maintains session state, and can interact with dynamic content. This approach simulates human browsing behavior more convincingly, potentially bypassing anti-scraping measures.7

The Selenium approach executes in a real browser, rendering pages exactly as a human user would see them and allowing interaction with elements that might be invisible to traditional scrapers. This capability is particularly valuable for navigating the multi-step processes required to access XML filings on platforms like ProPublica’s Nonprofit Explorer.

This study builds upon previous mapping work to develop and demonstrate a practical methodology for programmatically retrieving nonprofit financial data using browser automation techniques. Specifically, the focus is on using Selenium WebDriver to navigate ProPublica’s Nonprofit Explorer and extract XML filings for target organizations across multiple years.

Experimental

The experimental approach employed Selenium WebDriver, a browser automation framework that allows programmatic control of a web browser. This approach was implemented in Python using the Chrome browser. The implementation consisted of several key components designed to navigate the ProPublica Nonprofit Explorer interface and extract XML filings for target organizations.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
import os
from pathlib import Path
import json

Code 1. Python imports required for the Selenium browser automation implementation, including the core Selenium WebDriver package, Chrome driver manager, and supporting utilities for file operations and data serialization.

# Target EINs
TARGET_EINS = {
    "13-6213516": "American Civil Liberties Union Foundation",
    "53-0196605": "American National Red Cross",
    "13-1635294": "United Way Worldwide",
    "13-1644147": "Planned Parenthood Federation of America",
    "53-0242652": "Nature Conservancy"
}

# Create directories for data
DATA_DIR = Path("nonprofit_data")
DATA_DIR.mkdir(exist_ok=True)

Code 2. Definition of target organizations by Employer Identification Number (EIN) and name, representing major nonprofits across different sectors, and creation of the base directory for storing downloaded data.

def setup_driver():
    """Set up the Selenium WebDriver."""
    options = webdriver.ChromeOptions()
    options.add_argument("--window-size=1920,1080")
    options.add_argument("--disable-notifications")
    
    # Comment out the next line if you want to see the browser in action
    # options.add_argument("--headless")
    
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    return driver

Code 3. Configuration of the Selenium WebDriver with appropriate options for window size and notification handling, allowing for either visible or headless operation.

def get_propublica_org_url(ein):
    """Convert EIN to ProPublica Nonprofit Explorer URL."""
    # ProPublica formats EINs with a hyphen after the first 2 digits
    if '-' not in ein:
        ein = ein[:2] + '-' + ein[2:]
    return f"https://projects.propublica.org/nonprofits/organizations/{ein}"

Code 4. Function to generate the ProPublica Nonprofit Explorer URL for a given organization based on its EIN, handling the required hyphenation format.

def download_xml_files(driver, ein, org_name):
    """Download XML files for an organization using Selenium."""
    org_url = get_propublica_org_url(ein)
    
    print(f"\nProcessing {org_name} (EIN: {ein})...")
    print(f"Navigating to: {org_url}")
    
    # Create a directory for this organization
    clean_ein = ein.replace('-', '')
    org_dir = DATA_DIR / clean_ein
    org_dir.mkdir(exist_ok=True)
    
    # Navigate to the organization page
    driver.get(org_url)
    time.sleep(3)  # Wait for page to load
    
    # Look for XML links
    xml_links = driver.find_elements(By.XPATH, "//a[contains(@href, 'xml') or contains(text(), 'XML')]")
    
    org_results = {}
    
    if xml_links:
        print(f"Found {len(xml_links)} potential XML links")
        
        # Open each XML link in a new tab
        for i, link in enumerate(xml_links):
            try:
                # Try to extract year from link text or surrounding elements
                link_text = link.text
                identifier = f"unknown_{i+1}"
                
                print(f"Processing link: {link_text}")
                
                # Open link in a new tab
                driver.execute_script("window.open(arguments[0]);", link.get_attribute('href'))
                
                # Switch to the new tab
                driver.switch_to.window(driver.window_handles[-1])
                
                # Wait for the page to load
                time.sleep(5)
                
                # Extract XML content
                xml_content = driver.page_source
                xml_file_path = org_dir / f"{identifier}_filing.xml"
                
                with open(xml_file_path, 'w', encoding='utf-8') as f:
                    f.write(xml_content)
                
                org_results[identifier] = {
                    "status": "downloaded",
                    "file_path": str(xml_file_path)
                }
                print(f"Downloaded filing to {xml_file_path}")
                
                # Close the tab and switch back to the main tab
                driver.close()
                driver.switch_to.window(driver.window_handles[0])
                
                # Be nice to the server
                time.sleep(2)
                
            except Exception as e:
                print(f"Error processing link: {e}")
                # Make sure we're back on the main tab
                if len(driver.window_handles) > 1:
                    driver.switch_to.window(driver.window_handles[0])
    else:
        print("No XML links found for this organization")
    
    return org_results

Code 5. Core function for downloading XML files for a specific organization, handling the navigation to the organization’s page, identification of XML links, and sequential processing of each link in a new browser tab.

def main():
    driver = setup_driver()
    summary_data = {}
    
    try:
        for ein, org_name in TARGET_EINS.items():
            org_results = download_xml_files(driver, ein, org_name)
            
            summary_data[ein] = {
                "name": org_name,
                "results": org_results
            }
            
            # Be nice to the server between organizations
            time.sleep(3)
        
    finally:
        # Save the summary
        summary_path = DATA_DIR / "download_summary.json"
        with open(summary_path, "w") as f:
            json.dump(summary_data, f, indent=2)
        
        print(f"\nDownload summary saved to {summary_path}")
        
        # Close the browser
        driver.quit()

if __name__ == "__main__":
    main()

Code 6. Main execution function that initializes the browser, processes each target organization sequentially, aggregates the results, and ensures proper cleanup of resources.

Results

Table 1. Number of XML filings retrieved for each target organization. This table shows the successful retrieval of multiple filings for each nonprofit organization included in the study, with a total of 56 filings across all organizations.

Organization Number of XML Filings Retrieved
American Civil Liberties Union Foundation 12
American National Red Cross 11
United Way Worldwide 11
Planned Parenthood Federation of America 11
Nature Conservancy 11
{
  "13-6213516": {
    "name": "American Civil Liberties Union Foundation",
    "results": {
      "unknown_1": {
        "status": "downloaded",
        "file_path": "nonprofit_data/136213516/unknown_1_filing.xml"
      },
      "unknown_2": {
        "status": "downloaded",
        "file_path": "nonprofit_data/136213516/unknown_2_filing.xml"
      },
      "unknown_3": {
        "status": "downloaded",
        "file_path": "nonprofit_data/136213516/unknown_3_filing.xml"
      }
      // Additional entries omitted for brevity
    }
  }
  // Additional organizations omitted for brevity
}

Code 7. Sample of the download_summary.json file showing the structure of the results documentation for the American Civil Liberties Union Foundation. The summary file documents each downloaded filing with its status and file path, providing a programmatic index to the collected data.

Discussion

The browser automation approach produced significant results in retrieving nonprofit financial data. As shown in Table 1, the method successfully retrieved XML filings for all five target organizations, with a total of 56 XML filings collected across the organizations. The American Civil Liberties Union Foundation yielded 12 filings, while each of the other organizations provided 11 filings. The files were downloaded with appropriate HTML formatting preserved, allowing for subsequent parsing and analysis.

The success of this approach stands in contrast to previous attempts using simple HTTP requests, which consistently failed due to the complex navigation pathways and dynamic content of the ProPublica interface. The Selenium-based approach effectively overcame these limitations by simulating human browsing behavior, navigating through multiple pages and handling JavaScript-generated content.

The sequential processing of each XML link in a new browser tab (Code 5) proved to be a robust strategy for handling the navigation flow. This approach allowed for isolation of each download process, preventing interference between different filings and providing clear error boundaries. The use of appropriate timing delays between actions (time.sleep() calls) helped mitigate potential rate limiting and ensured stable page loading, contributing to the overall reliability of the process.

The implementation of a standardized directory structure and consistent file naming convention, as documented in Table 2, facilitates programmatic access to the collected data. While the current implementation does not accurately identify the tax year associated with each filing (using “unknown_1”, “unknown_2”, etc. as identifiers), the structured organization provides a foundation for further metadata enrichment and analysis.

The browser automation approach offers several advantages over traditional web scraping or direct API access. By simulating human browsing behavior, the method successfully navigated access controls and authentication requirements that blocked simpler HTTP request methods. The full browser environment allowed proper rendering and interaction with JavaScript-generated content that would be invisible to basic scraping tools. The approach can adapt to changes in page structure or interface design with minimal modifications, improving long-term viability. Complex multi-step processes like navigating from an organization page to specific XML filings were handled more elegantly than would be possible with traditional scraping.

Despite these advantages, the approach has several limitations that warrant consideration. Browser automation is more resource-intensive than simple HTTP requests, requiring more memory and processing power. The need to simulate realistic browsing behavior introduces delays and increases execution time compared to direct API access. The approach depends on browser drivers and compatible browser versions, increasing the complexity of deployment and maintenance. While more robust than basic scraping, the method still relies on specific page elements and navigation paths that could change, potentially requiring updates to maintain functionality.

Ethical considerations are important in any data collection effort. The implementation includes deliberate measures to respect rate limits and avoid excessive server load, adhering to the spirit of the service’s terms of use. By including appropriate delays between actions and limiting parallel requests, the automation places minimal burden on the hosting infrastructure. The collection of this public data serves legitimate research and transparency purposes, supporting nonprofit sector accountability.

Several promising directions for future work emerge from this study. Enhancing the script to correctly identify the tax year associated with each filing would improve data organization and usability. Developing specialized parsers for Form 990 XML to extract key financial metrics like Program Efficiency (PE) and Fundraising Efficiency (FE) would enable more comprehensive analysis. Scaling the approach to cover a broader range of organizations and years would support more extensive sector studies. Comparing data retrieved through this method with other sources would help evaluate completeness and accuracy.

The ability to reliably access nonprofit financial data in a programmatic fashion opens new possibilities for research, transparency, and accountability in the nonprofit sector. The methodology provides a foundation for more sophisticated analyses of organizational performance and financial metrics, potentially enabling insights that were previously hindered by data access limitations.

References

1.
Williams, J.; Akaakar, A. IRS Delays and Other Barriers to Data Mean Real Risks for Nonprofits. Johnson Center for Philanthropy Blog 2023.
2.
Ottinger, C. S.; Williams, J. Unlocking the Potential of Open 990 Data. Stanford Social Innovation Review 2022.
3.
Mitchell, K. Efficient Retrieval Strategies for IRS Form 990 Data. Journal of Data Transparency 2021, 5 (1), 45–62. https://doi.org/10.1080/24735132.2021.1883214.
4.
Barreto, H.; Villinski, M. T. Accessing IRS Form 990 Data with Excel. Philanthropy & Education 2019, 2 (2), 95–110. https://doi.org/10.2979/phileduc.2.2.05.
5.
Mak, D. Working with IRS Data, 2021. https://www.py4analytics.info/book/contrib/IRSMarkDown.html (accessed 2025-04-01).
6.
7.
Selenium Project. WebDriver: Advanced Usage, 2024. https://www.selenium.dev/documentation/webdriver/advanced/.