Work Package Prompt: WP01 – Foundation - Crawler Engine
Objectives & Success Criteria
- Initialize a Python Scrapy project capable of mirroring Drupal.org documentation.
- Implement a spider that traverses
/docsand/documentation. - Success: Running the crawler populates
content/htmlandcontent/mediawithout triggering d.o rate limits.
Context & Constraints
- Required for all downstream transformation and AI tasks.
- Must honor defined throttle control: 2s delay, 1 concurrent request.
- Reference: plan.md, research.md.
Subtasks & Detailed Guidance
Subtask T001 – Initialize Python Scrapy project
- Purpose: Set up the environment and Scrapy boilerplate.
- Steps:
- Create
/crawlerdirectory. - Initialize
scrapy startproject drupal_crawler .. - Define basic item structure for
Document(url, title, html, assets).
- Create
- Files:
/crawler/scrapy.cfg,/crawler/items.py.
Subtask T002 – Implement DocumentationSpider
- Purpose: Traverse and fetch d.o documentation pages.
- Steps:
- Create spider in
spiders/doc_spider.py. - Target
drupal.org/docsanddrupal.org/documentation. - Extract title and main content area HTML.
- Create spider in
- Files:
/crawler/spiders/doc_spider.py.
Subtask T003 – Implement AssetPipeline
- Purpose: Download and store images/PDFs.
- Steps:
- Configure
FilesPipeline. - Store files in
content/media/preserving d.o relative paths if possible.
- Configure
- Files:
/crawler/pipelines.py,/crawler/settings.py.
Subtask T004 – Implement SyncSession tracking
- Purpose: Support incremental crawls.
- Steps:
- Maintain a JSON file (e.g.,
sync_state.json) incontent/. - Track last crawled URL and timestamp per page.
- Maintain a JSON file (e.g.,
- Files:
/crawler/spiders/doc_spider.py.
Subtask T005 – Add Scrapy throttle configuration
- Purpose: Ensure zero negative impact on d.o.
- Steps:
- Set
DOWNLOAD_DELAY = 2. - Set
CONCURRENT_REQUESTS_PER_DOMAIN = 1. - Enable
AUTOTHROTTLE_ENABLED = True.
- Set
- Files:
/crawler/settings.py.
Risks & Mitigations
- Risk: IP Blocking by Drupal.org.
- Mitigation: Very aggressive throttling and user-agent rotation.
Activity Log
- 2026-03-04T12:15:00Z – system – lane=planned – Prompt created.