Work Package Prompt: WP02 – Conversion - Semantic Markdown Transformer
Objectives & Success Criteria
- Convert raw HTML documentation into structured Markdown.
- Extract semantic metadata (RDFa/Frontmatter) for Jekyll consumption.
- Provide a unified CLI orchestrator for the sync process.
Context & Constraints
- Depends on output from WP01.
- Model Routing: Use Ollama (
qwen2.5-coder:7b) athttp://localhost:11434for routine Markdown normalization and metadata extraction. - Must preserve semantic structure (headings, lists, code blocks).
Subtasks & Detailed Guidance
Subtask T006 – Implement MarkdownTransformer
- Purpose: Transform HTML to high-quality Markdown.
- Steps:
- Use Pandoc or a Python library like
markdownify. - Ensure links to Drupal.org are preserved or relative-linked if the target exists in content.
- Use Pandoc or a Python library like
- Files:
/crawler/transformers/md_transformer.py.
Subtask T007 – Implement Metadata Extractor
- Purpose: Extract semantic context into YAML Frontmatter.
- Steps:
- Parse HTML for Microformats/RDFa tags.
- Map these to defined schema fields in
data-model.md.
- Files:
/crawler/transformers/metadata_extractor.py.
Subtask T008 – Implement main.py CLI orchestrator
- Purpose: Entry point for users and CI.
- Steps:
- Implement commands:
crawl,transform,sync(both). - Use
argparseorclick.
- Implement commands:
- Files:
/crawler/main.py.
Subtask T009 – Implement Contributor Extractor
- Purpose: Identify experts for content review.
- Steps:
- If a documentation page links to an issue queue, fetch the top contributors list.
- Append these usernames to the
suggested_reviewersfrontmatter.
- Files:
/crawler/transformers/contributor_extractor.py.
Risks & Mitigations
- Risk: Variable HTML structure on d.o causing broken Markdown.
- Mitigation: Implement robust sanitization and fallbacks for common d.o content wrapper patterns.
Activity Log
- 2026-03-04T12:15:00Z – system – lane=planned – Prompt created.