Technology Scanning

Stats as of 2026-04-29 06:11 UTC — last scan: 2026-04-28

3 scan batches run

3,755 of 3,863 available pages scanned (97.2% coverage) 2,805 pages with technology detections (74.7% of scanned) 251 unique technologies identified

Technology Scan by Country

Country	URLs Scanned	Pages with Detections	Available	Last Scan
Usa Edu Master	3,749	2,799	3,763	2026-04-28
Usa Edu Top100	100	88	100	2026-04-28

Hover or focus any non-zero country-table count to preview matching pages. Activate the number to keep the preview open and download a CSV for that country and metric from Download machine-readable technology data (JSON).

Top Technologies

#	Technology	Pages	Categories
1	jQuery	1,584	JavaScript libraries
2	Google Tag Manager	1,388	Tag managers
3	PHP	1,356	Programming languages
4	Font Awesome	1,087	Font scripts
5	Google Font API	979	Font scripts
6	Nginx	753	Reverse proxies, Web servers
7	Bootstrap	707	UI frameworks
8	Cloudflare	695	CDN
9	Apache	687	Web servers
10	MySQL	667	Databases
11	WordPress	664	Blogs, CMS
12	jQuery Migrate	548	JavaScript libraries
13	Varnish	416	Caching
14	jsDelivr	387	CDN
15	Slick	334	JavaScript libraries
16	Pantheon	292	PaaS
17	MariaDB	292	Databases
18	Windows Server	282	Operating systems
19	IIS	280	Web servers
20	Yoast SEO	276	SEO

Top Technology Categories

#	Category	Pages
1	JavaScript libraries	3,515
2	Font scripts	2,082
3	Web servers	1,898
4	Programming languages	1,441
5	Tag managers	1,393
6	CDN	1,267
7	Databases	1,086
8	CMS	1,034
9	UI frameworks	1,013
10	Reverse proxies	767
11	PaaS	752
12	Blogs	697
13	Caching	595
14	Operating systems	431
15	Miscellaneous	314

📥 Machine-readable results: Download machine-readable technology data (JSON)

Overview

The technology scanner fetches each government page and uses python-Wappalyzer to identify technologies from HTTP response headers and HTML content. Detected technologies (CMS, web server, JavaScript frameworks, analytics, etc.) and their versions are stored in the metadata database and written back into an annotated *_tech.toon TOON file.

Scans run automatically every 6 hours via GitHub Actions so that the full set of URLs across all seed files can be covered gradually without overloading government servers.

Usage

Scan a single seed

python3 -m src.cli.scan_technology --country USA_EDU_MASTER --rate-limit 2

Scan all seed files

python3 -m src.cli.scan_technology --all --rate-limit 2

Scan all seed files with a runtime cap (recommended for CI)

python3 -m src.cli.scan_technology --all --max-runtime 110 --rate-limit 2.0

Command-line options

Option	Default	Description
`--country CODE`	—	Seed code to scan (e.g. `USA_EDU_MASTER`)
`--all`	—	Scan all seed files in the TOON directory
`--toon-dir PATH`	`data/toon-seeds`	Directory with `.toon` seed files
`--rate-limit N`	`2.0`	Maximum HTTP requests per second
`--max-runtime N`	`0` (no limit)	Maximum runtime in minutes. The scanner stops gracefully before this limit so that partial results can be saved. Set to ~10 minutes less than the GitHub Actions `timeout-minutes` value.

GitHub Actions

The Scan Technology Stack workflow (.github/workflows/scan-technology.yml) runs automatically every 6 hours and can also be triggered manually from the Actions tab:

Go to Actions → Scan Technology Stack → Run workflow
Optionally enter a seed code (leave blank to scan all seed files)
Optionally adjust the rate limit

Artifacts uploaded after each run:

Artifact	Contents
`tech-scan-<run_number>`	`data/metadata.db`, scan output log, annotated `*_tech.toon` files
`validation-metadata`	`data/metadata.db` (shared with URL validation and social media scans)

Output

Annotated TOON file

Each page entry in the output *_tech.toon file gains a technologies field:

{
  "url": "https://example.gov/",
  "is_root_page": true,
  "technologies": {
    "Nginx": { "versions": ["1.24"], "categories": ["Web servers"] },
    "WordPress": { "versions": ["6.2"], "categories": ["CMS", "Blogs"] }
  }
}

If detection failed for a URL, a tech_error field is added instead:

{
  "url": "https://unreachable.gov/",
  "tech_error": "Connection error: ..."
}

Database table

Results are stored in the url_tech_results table:

Column	Type	Description
`url`	TEXT	Page URL
`country_code`	TEXT	Legacy field name for seed identifier (e.g. `USA_EDU_MASTER`)
`scan_id`	TEXT	Unique scan run ID
`technologies`	TEXT	JSON object of detected technologies
`error_message`	TEXT	Error message (if detection failed)
`scanned_at`	TEXT	ISO-8601 timestamp

Query example:

SELECT url, technologies
FROM url_tech_results
WHERE country_code = 'USA_EDU_MASTER'
ORDER BY scanned_at DESC;

Architecture

flowchart TD
    A["scan-technology.yml\n(GitHub Actions — every 6 hours)"]
    A --> B["scan_technology.py (CLI)"]
    B --> C["TechScanner.scan_country()"]
    C --> D["TechDetector.detect_urls_batch()"]
    D --> E["For each URL"]
    E --> F["httpx.get() → HTML + headers"]
    F --> G["Wappalyzer.analyze_with_versions_and_categories()"]
    G --> H["Save to url_tech_results table\n(incremental, per URL)"]
    H --> I["Write *_tech.toon output file"]

Notes

Rate limiting is applied between requests to avoid overloading government servers. The default is 2 requests per second.
Technology fingerprinting is best-effort; some sites may return no detections if they use custom or obfuscated stacks.
Unlike the URL validator, failed tech scans do not mark a URL for removal — errors are recorded but the URL is kept in future scan cycles.
Results are persisted incrementally (one URL at a time) so that partial results are preserved even if the GitHub Actions job times out.
The *_tech.toon output files are excluded from version control (see .gitignore).