Tasks: Drupal Documentation Upgrade System
This document outlines the work packages (WPs) and subtasks required to implement the automated documentation upgrade system.
Foundational Phase
WP01: Foundation - Crawler Engine (Priority: P1)
Goal: Build the core Python-based crawling engine to mirror d.o content.
- T001: Initialize Python Scrapy project in
/crawler - T002: Implement
DocumentationSpiderfor d.o docs - T003: Implement
AssetPipelinefor images/PDFs - T004: Implement persistent
SyncSessiontracking - T005: Add Scrapy throttle/rate-limiting configuration
Summary: Sets up the base infrastructure for mirroring content. Success Criteria: Running the crawler mirrors text and media assets to the local content/ folder without triggering d.o rate limits.
WP02: Conversion - Semantic Markdown Transformer (Priority: P1)
Goal: Transform mirrored HTML into structured Markdown with semantic metadata.
- T006: Implement
MarkdownTransformer(HTML to MD) - T007: Implement Metadata Extractor (RDFa/Microformats to Frontmatter)
- T008: Implement
main.pyCLI orchestrator - T009: Implement Contributor Extractor for reviewer suggestions
Summary: Handles the data transformation layer using a hybrid AI approach (Ollama for routine cleanup/extraction). Success Criteria: Mirrored HTML is successfully converted to .md files with rich YAML frontmatter.
Intelligence & Automation Phase
WP03: Intelligence - AI Merging & Gap Identification (Priority: P2)
Goal: Leverage Gemini to merge content and find documentation gaps.
- T010: Integrate Gemini API Client
- T011: Implement “AI Merger” for d.o + Drupal CMS consolidation
- T012: Implement Gap Analysis logic (detecting missing/EOL info)
Summary: Adds the AI layer for “smart” documentation upgrades. Success Criteria: AI can identify missing D11 info and merge overlapping content sources.
WP04: Automation - Reporting & GitHub Actions (Priority: P2)
Goal: Automate the sync and reporting workflow via GitHub.
- T013: Implement GitHub Issue Creator for gaps
- T014: Implement Jekyll-compatible Gap Report generator
- T015: Create GitHub Action (Daily Cron) workflow
- T016: Implement Comment-Triggered Workflow (“Confirmed” trigger)
Summary: Connects the local automation to the GitHub ecosystem. Success Criteria: Daily runs create GitHub issues for gaps and update the web-based report.
Presentation Phase
WP05: Portal - Jekyll Documentation Site (Priority: P3)
Goal: Serve the upgraded documentation via a modern interface.
- T017: Set up Jekyll site structure and theme
- T018: Implement dynamic navigation from file structure
- T019: Implement version-specific UI callouts
Summary: Provides the final user interface for the documentation. Success Criteria: Consolidated docs are navigable, searchable, and version-aware on GitHub Pages.