Technology Scanning
Stats as of 2026-04-29 06:11 UTC — last scan: 2026-04-28
3 scan batches run
3,755 of 3,863 available pages scanned (97.2% coverage) 2,805 pages with technology detections (74.7% of scanned) 251 unique technologies identified
Technology Scan by Country
| Country | URLs Scanned | Pages with Detections | Available | Last Scan |
|---|---|---|---|---|
| Usa Edu Master | 3,749 | 2,799 | 3,763 | 2026-04-28 |
| Usa Edu Top100 | 100 | 88 | 100 | 2026-04-28 |
Hover or focus any non-zero country-table count to preview matching pages. Activate the number to keep the preview open and download a CSV for that country and metric from Download machine-readable technology data (JSON).
Top Technologies
| # | Technology | Pages | Categories |
|---|---|---|---|
| 1 | jQuery | 1,584 | JavaScript libraries |
| 2 | Google Tag Manager | 1,388 | Tag managers |
| 3 | PHP | 1,356 | Programming languages |
| 4 | Font Awesome | 1,087 | Font scripts |
| 5 | Google Font API | 979 | Font scripts |
| 6 | Nginx | 753 | Reverse proxies, Web servers |
| 7 | Bootstrap | 707 | UI frameworks |
| 8 | Cloudflare | 695 | CDN |
| 9 | Apache | 687 | Web servers |
| 10 | MySQL | 667 | Databases |
| 11 | WordPress | 664 | Blogs, CMS |
| 12 | jQuery Migrate | 548 | JavaScript libraries |
| 13 | Varnish | 416 | Caching |
| 14 | jsDelivr | 387 | CDN |
| 15 | Slick | 334 | JavaScript libraries |
| 16 | Pantheon | 292 | PaaS |
| 17 | MariaDB | 292 | Databases |
| 18 | Windows Server | 282 | Operating systems |
| 19 | IIS | 280 | Web servers |
| 20 | Yoast SEO | 276 | SEO |
Top Technology Categories
| # | Category | Pages |
|---|---|---|
| 1 | JavaScript libraries | 3,515 |
| 2 | Font scripts | 2,082 |
| 3 | Web servers | 1,898 |
| 4 | Programming languages | 1,441 |
| 5 | Tag managers | 1,393 |
| 6 | CDN | 1,267 |
| 7 | Databases | 1,086 |
| 8 | CMS | 1,034 |
| 9 | UI frameworks | 1,013 |
| 10 | Reverse proxies | 767 |
| 11 | PaaS | 752 |
| 12 | Blogs | 697 |
| 13 | Caching | 595 |
| 14 | Operating systems | 431 |
| 15 | Miscellaneous | 314 |
📥 Machine-readable results: Download machine-readable technology data (JSON)
Overview
The technology scanner fetches each government page and uses
python-Wappalyzer to identify
technologies from HTTP response headers and HTML content. Detected
technologies (CMS, web server, JavaScript frameworks, analytics, etc.) and
their versions are stored in the metadata database and written back into an
annotated *_tech.toon TOON file.
Scans run automatically every 6 hours via GitHub Actions so that the full set of URLs across all seed files can be covered gradually without overloading government servers.
Usage
Scan a single seed
python3 -m src.cli.scan_technology --country USA_EDU_MASTER --rate-limit 2
Scan all seed files
python3 -m src.cli.scan_technology --all --rate-limit 2
Scan all seed files with a runtime cap (recommended for CI)
python3 -m src.cli.scan_technology --all --max-runtime 110 --rate-limit 2.0
Command-line options
| Option | Default | Description |
|---|---|---|
--country CODE |
— | Seed code to scan (e.g. USA_EDU_MASTER) |
--all |
— | Scan all seed files in the TOON directory |
--toon-dir PATH |
data/toon-seeds |
Directory with .toon seed files |
--rate-limit N |
2.0 |
Maximum HTTP requests per second |
--max-runtime N |
0 (no limit) |
Maximum runtime in minutes. The scanner stops gracefully before this limit so that partial results can be saved. Set to ~10 minutes less than the GitHub Actions timeout-minutes value. |
GitHub Actions
The Scan Technology Stack workflow (.github/workflows/scan-technology.yml)
runs automatically every 6 hours and can also be triggered manually from the
Actions tab:
- Go to Actions → Scan Technology Stack → Run workflow
- Optionally enter a seed code (leave blank to scan all seed files)
- Optionally adjust the rate limit
Artifacts uploaded after each run:
| Artifact | Contents |
|---|---|
tech-scan-<run_number> |
data/metadata.db, scan output log, annotated *_tech.toon files |
validation-metadata |
data/metadata.db (shared with URL validation and social media scans) |
Output
Annotated TOON file
Each page entry in the output *_tech.toon file gains a technologies field:
{
"url": "https://example.gov/",
"is_root_page": true,
"technologies": {
"Nginx": { "versions": ["1.24"], "categories": ["Web servers"] },
"WordPress": { "versions": ["6.2"], "categories": ["CMS", "Blogs"] }
}
}
If detection failed for a URL, a tech_error field is added instead:
{
"url": "https://unreachable.gov/",
"tech_error": "Connection error: ..."
}
Database table
Results are stored in the url_tech_results table:
| Column | Type | Description |
|---|---|---|
url |
TEXT | Page URL |
country_code |
TEXT | Legacy field name for seed identifier (e.g. USA_EDU_MASTER) |
scan_id |
TEXT | Unique scan run ID |
technologies |
TEXT | JSON object of detected technologies |
error_message |
TEXT | Error message (if detection failed) |
scanned_at |
TEXT | ISO-8601 timestamp |
Query example:
SELECT url, technologies
FROM url_tech_results
WHERE country_code = 'USA_EDU_MASTER'
ORDER BY scanned_at DESC;
Architecture
flowchart TD
A["scan-technology.yml\n(GitHub Actions — every 6 hours)"]
A --> B["scan_technology.py (CLI)"]
B --> C["TechScanner.scan_country()"]
C --> D["TechDetector.detect_urls_batch()"]
D --> E["For each URL"]
E --> F["httpx.get() → HTML + headers"]
F --> G["Wappalyzer.analyze_with_versions_and_categories()"]
G --> H["Save to url_tech_results table\n(incremental, per URL)"]
H --> I["Write *_tech.toon output file"]
Notes
- Rate limiting is applied between requests to avoid overloading government servers. The default is 2 requests per second.
- Technology fingerprinting is best-effort; some sites may return no detections if they use custom or obfuscated stacks.
- Unlike the URL validator, failed tech scans do not mark a URL for removal — errors are recorded but the URL is kept in future scan cycles.
- Results are persisted incrementally (one URL at a time) so that partial results are preserved even if the GitHub Actions job times out.
- The
*_tech.toonoutput files are excluded from version control (see.gitignore).