Stats as of 2026-04-29 06:11 UTC — last scan: 2026-04-28

3 scan batches run

3,755 of 3,863 available pages scanned (97.2% coverage) 2,805 pages with technology detections (74.7% of scanned) 251 unique technologies identified


Technology Scan by Country

Country URLs Scanned Pages with Detections Available Last Scan
Usa Edu Master 3,749 2,799 3,763 2026-04-28
Usa Edu Top100 100 88 100 2026-04-28

Hover or focus any non-zero country-table count to preview matching pages. Activate the number to keep the preview open and download a CSV for that country and metric from Download machine-readable technology data (JSON).


Top Technologies

# Technology Pages Categories
1 jQuery 1,584 JavaScript libraries
2 Google Tag Manager 1,388 Tag managers
3 PHP 1,356 Programming languages
4 Font Awesome 1,087 Font scripts
5 Google Font API 979 Font scripts
6 Nginx 753 Reverse proxies, Web servers
7 Bootstrap 707 UI frameworks
8 Cloudflare 695 CDN
9 Apache 687 Web servers
10 MySQL 667 Databases
11 WordPress 664 Blogs, CMS
12 jQuery Migrate 548 JavaScript libraries
13 Varnish 416 Caching
14 jsDelivr 387 CDN
15 Slick 334 JavaScript libraries
16 Pantheon 292 PaaS
17 MariaDB 292 Databases
18 Windows Server 282 Operating systems
19 IIS 280 Web servers
20 Yoast SEO 276 SEO

Top Technology Categories

# Category Pages
1 JavaScript libraries 3,515
2 Font scripts 2,082
3 Web servers 1,898
4 Programming languages 1,441
5 Tag managers 1,393
6 CDN 1,267
7 Databases 1,086
8 CMS 1,034
9 UI frameworks 1,013
10 Reverse proxies 767
11 PaaS 752
12 Blogs 697
13 Caching 595
14 Operating systems 431
15 Miscellaneous 314

📥 Machine-readable results: Download machine-readable technology data (JSON)


Overview

The technology scanner fetches each government page and uses python-Wappalyzer to identify technologies from HTTP response headers and HTML content. Detected technologies (CMS, web server, JavaScript frameworks, analytics, etc.) and their versions are stored in the metadata database and written back into an annotated *_tech.toon TOON file.

Scans run automatically every 6 hours via GitHub Actions so that the full set of URLs across all seed files can be covered gradually without overloading government servers.


Usage

Scan a single seed

python3 -m src.cli.scan_technology --country USA_EDU_MASTER --rate-limit 2

Scan all seed files

python3 -m src.cli.scan_technology --all --rate-limit 2
python3 -m src.cli.scan_technology --all --max-runtime 110 --rate-limit 2.0

Command-line options

Option Default Description
--country CODE Seed code to scan (e.g. USA_EDU_MASTER)
--all Scan all seed files in the TOON directory
--toon-dir PATH data/toon-seeds Directory with .toon seed files
--rate-limit N 2.0 Maximum HTTP requests per second
--max-runtime N 0 (no limit) Maximum runtime in minutes. The scanner stops gracefully before this limit so that partial results can be saved. Set to ~10 minutes less than the GitHub Actions timeout-minutes value.

GitHub Actions

The Scan Technology Stack workflow (.github/workflows/scan-technology.yml) runs automatically every 6 hours and can also be triggered manually from the Actions tab:

  1. Go to Actions → Scan Technology Stack → Run workflow
  2. Optionally enter a seed code (leave blank to scan all seed files)
  3. Optionally adjust the rate limit

Artifacts uploaded after each run:

Artifact Contents
tech-scan-<run_number> data/metadata.db, scan output log, annotated *_tech.toon files
validation-metadata data/metadata.db (shared with URL validation and social media scans)

Output

Annotated TOON file

Each page entry in the output *_tech.toon file gains a technologies field:

{
  "url": "https://example.gov/",
  "is_root_page": true,
  "technologies": {
    "Nginx": { "versions": ["1.24"], "categories": ["Web servers"] },
    "WordPress": { "versions": ["6.2"], "categories": ["CMS", "Blogs"] }
  }
}

If detection failed for a URL, a tech_error field is added instead:

{
  "url": "https://unreachable.gov/",
  "tech_error": "Connection error: ..."
}

Database table

Results are stored in the url_tech_results table:

Column Type Description
url TEXT Page URL
country_code TEXT Legacy field name for seed identifier (e.g. USA_EDU_MASTER)
scan_id TEXT Unique scan run ID
technologies TEXT JSON object of detected technologies
error_message TEXT Error message (if detection failed)
scanned_at TEXT ISO-8601 timestamp

Query example:

SELECT url, technologies
FROM url_tech_results
WHERE country_code = 'USA_EDU_MASTER'
ORDER BY scanned_at DESC;

Architecture

flowchart TD
    A["scan-technology.yml\n(GitHub Actions — every 6 hours)"]
    A --> B["scan_technology.py (CLI)"]
    B --> C["TechScanner.scan_country()"]
    C --> D["TechDetector.detect_urls_batch()"]
    D --> E["For each URL"]
    E --> F["httpx.get() → HTML + headers"]
    F --> G["Wappalyzer.analyze_with_versions_and_categories()"]
    G --> H["Save to url_tech_results table\n(incremental, per URL)"]
    H --> I["Write *_tech.toon output file"]

Notes

  • Rate limiting is applied between requests to avoid overloading government servers. The default is 2 requests per second.
  • Technology fingerprinting is best-effort; some sites may return no detections if they use custom or obfuscated stacks.
  • Unlike the URL validator, failed tech scans do not mark a URL for removal — errors are recorded but the URL is kept in future scan cycles.
  • Results are persisted incrementally (one URL at a time) so that partial results are preserved even if the GitHub Actions job times out.
  • The *_tech.toon output files are excluded from version control (see .gitignore).