Turn any sitemap into clean Markdown.
Built for Crawl4AI workflows.
Crawlboy reads every URL from a sitemap (including nested <sitemapindex> files), renders each page with Crawl4AI, and writes one .md file per URL—mirroring paths under your output directory’s md/ folder. Add raw HTML, downloaded images, and failure logs using the flags in the CLI reference below.
Open source under MIT.
Why teams use Crawlboy
Batch static sites, docs portals, and blogs into a structured Markdown corpus—ready for search, RAG pipelines, mirrors, and offline reading—without building a custom crawler.
Sitemap discovery that “just finds it”
Auto-detect the sitemap from robots.txt or common paths. Follows nested <sitemapindex> entries recursively. Point at a site root or paste a sitemap URL—Crawlboy resolves the rest.
Markdown output with stable paths
One .md file per page, mirroring the URL path. Optional raw HTML under html/ with --save-html.
Images without duplicates
Save images to media/, content-addressed and deduplicated. Enable with --download-images.
Prefer prompts to flags? crawlboy -i walks through sitemap source, output directory, and options before the crawl starts.
$ crawlboy --interactive
ASCII art banner spelling Crawlboy, followed by an example interactive wizard that asks for a sitemap URL, output directory, and optional flags.
──────────────────────────────────────────────── ? How would you like to specify the site? ❯ Provide a sitemap URL directly Provide a site root URL (auto-discover sitemap) ? Sitemap URL: https://example.com/sitemap.xml ? Output directory: ./out ? Options (space to select): ◉ Save raw HTML (--save-html) ◯ Download images (--download-images) ◯ Stop on first error (--fail-fast) ──────────────────────────────────────────────── Starting crawl of https://example.com/sitemap.xml ✓ 42 URLs foundGuided setup for your first crawl
Run the guided wizard and Crawlboy will ask for the sitemap source, output directory, and optional settings—no flag memorization required.
Powered by questionary and Rich for a polished terminal experience.
From sitemap URLs to a folder of Markdown
URL paths become file paths
Each URL maps under your output directory’s md/ folder, mirroring the site hierarchy; / becomes index.md. For example, /blog/articles/basic-git-commands/ becomes out/md/blog/articles/basic-git-commands.md.
Optional HTML mirror
Add --save-html to write raw HTML beside Markdown under html/, mirroring the same directory structure.
Content-addressed images
Add --download-images to store assets in media/ with deduplicated content hashes—duplicates across pages are stored only once.
Failure tracking
Failed pages append to errors.jsonl; add --fail-fast to stop immediately on the first error instead.
Install in two commands
--sitemap-url https://example.com/sitemap.xml \
--out-dir /out
CLI flags (complete reference)
Every crawl requires an output directory (--out-dir) and either a direct sitemap (--sitemap-url) or a site root for discovery (--site-url).
| Flag | Description | Default |
|---|---|---|
--sitemap-url |
Direct sitemap URL to crawl | — |
--site-url |
Site root URL — sitemap is auto-discovered from robots.txt or common paths | — |
--out-dir |
Directory to write output files into | — |
--delay |
Seconds to wait between each page crawl | 0 |
--page-timeout-ms |
Navigation timeout in milliseconds per page | 60000 |
--max-urls |
Cap the number of URLs to crawl | unlimited |
--save-html |
Write raw HTML files under html/ in addition to Markdown |
off |
--download-images |
Save images to media/, content-addressed and deduplicated |
off |
--no-headless |
Show the browser window while crawling (disables headless mode) | off |
--fail-fast |
Stop the entire crawl on the first page error | off |
--include-offsite-urls |
Allow crawling URLs on hosts outside the origin site | off |
-i, --interactive |
Launch the guided wizard instead of using flags | off |
Frequently asked questions
What is Crawlboy?
Crawlboy is a Python command-line tool that crawls every URL listed in a website sitemap—including nested sitemap indexes—and saves each page as Markdown using Crawl4AI.
How do I install Crawlboy?
Run pip install crawlboy, then crawl4ai-setup to install Playwright and Chromium. Use crawl4ai-doctor to verify your browser setup.
How do I crawl a site from its root URL?
Run crawlboy --site-url 'https://example.com' --out-dir ./out. Crawlboy discovers the sitemap from robots.txt or common paths.
How do I crawl a known sitemap URL?
Run crawlboy --sitemap-url 'https://example.com/sitemap.xml' --out-dir ./out.
Does Crawlboy support nested sitemap indexes?
Yes. Crawlboy follows sitemap index files and expands nested sitemapindex entries recursively.
What files are created in the output directory?
By default you get md/ with one .md file per URL and errors.jsonl for failures. Use --save-html for html/ and --download-images for media/.
Ship a Markdown mirror today
Install from PyPI, run crawl4ai-setup once, then crawl your sitemap with flags or interactive mode.