Research: Drupal Documentation Upgrade System

This document captures the research findings and architectural decisions for the Drupal documentation upgrade automation.

Decision Log

Decision: Python-based Crawler

Rationale: Python offers superior libraries for web scraping (Scrapy), HTML parsing (BeautifulSoup), and data transformation (Pandoc wrappers). It also integrates natively with many AI SDKs (Gemini).
Alternatives Considered: Node.js (Puppeteer) was considered but identified as overkill for the mostly static Drupal.org documentation.

Decision: Hybrid Reporting (Issues + Jekyll)

Rationale: GitHub Issues provide an actionable “in-box” for community members to see gaps, while a Jekyll-rendered report provides a high-level overview of the library’s health. The “Confirmed” comment trigger allows for human-in-the-loop verification before major upgrades are merged.

Decision: AI Merger Strategy

Rationale: Simplifies the documentation experience by presenting a single “best” version of a requirement, rather than forcing users to check multiple disparate sources.

Decision: Local Model Routing (Ollama)

Rationale: Use qwen2.5-coder:7b via http://localhost:11434 for routine Markdown cleanup and metadata extraction to save Gemini for high-level reasoning.

Implementation Details

Metadata Schema (RDFa/Frontmatter)

The system will inject the following mandatory fields into Markdown frontmatter:

source_url: Original d.o or Drupal CMS URL.
drupal_version: Primary version target (e.g., D11).
related_versions: List of other supported versions covered (D10, Drupal CMS).
suggested_reviewers: List of top contributors extracted from d.o issue queues.
last_sync: ISO timestamp.

Throttle Control

To avoid load on drupal.org, the Scrapy engine will be configured with:

DOWNLOAD_DELAY: 2 seconds (minimum).
CONCURRENT_REQUESTS_PER_DOMAIN: 1.
AUTOTHROTTLE_ENABLED: True.

Dependencies & Best Practices

Scrapy: Use FilesPipeline for mirroring pdfs/images.
Pandoc: Use for high-fidelity HTML-to-Markdown conversion.
GitHub Actions: Use workflow_dispatch for manual runs and schedule for daily cron.