Python CLI · Crawl4AI · Sitemap → Markdown

Turn any sitemap into clean Markdown.
Built for Crawl4AI workflows.

Crawlboy reads every URL from a sitemap (including nested <sitemapindex> files), renders each page with Crawl4AI, and writes one .md file per URL—mirroring paths under your output directory’s md/ folder. Add raw HTML, downloaded images, and failure logs using the flags in the CLI reference below.

$ pip install crawlboy

Open source under MIT.

crawlboy — zsh
What's inside

Why teams use Crawlboy

Batch static sites, docs portals, and blogs into a structured Markdown corpus—ready for search, RAG pipelines, mirrors, and offline reading—without building a custom crawler.

travel_explore

Sitemap discovery that “just finds it”

Auto-detect the sitemap from robots.txt or common paths. Follows nested <sitemapindex> entries recursively. Point at a site root or paste a sitemap URL—Crawlboy resolves the rest.

$ --site-url https://example.com
article

Markdown output with stable paths

One .md file per page, mirroring the URL path. Optional raw HTML under html/ with --save-html.

$ --out-dir ./output
image

Images without duplicates

Save images to media/, content-addressed and deduplicated. Enable with --download-images.

$ --download-images
Interactive Mode

Prefer prompts to flags? crawlboy -i walks through sitemap source, output directory, and options before the crawl starts.

$ crawlboy --interactive

interactive wizard
$ crawlboy -i ██████╗██████╗ █████╗ ██╗ ██╗██╗ ██████╗ ██████╗ ██╗ ██╗ ██╔════╝██╔══██╗██╔══██╗██║ ██║██║ ██╔══██╗██╔═══██╗╚██╗ ██╔╝ ██║ ██████╔╝███████║██║ █╗ ██║██║ ██████╔╝██║ ██║ ╚████╔╝ ██║ ██╔══██╗██╔══██║██║███╗██║██║ ██╔══██╗██║ ██║ ╚██╔╝ ╚██████╗██║ ██║██║ ██║╚███╔███╔╝███████╗██████╔╝╚██████╔╝ ██║ ╚═════╝╚═╝ ╚═╝╚═╝ ╚═╝ ╚══╝╚══╝ ╚══════╝╚═════╝ ╚═════╝ ╚═╝

ASCII art banner spelling Crawlboy, followed by an example interactive wizard that asks for a sitemap URL, output directory, and optional flags.

──────────────────────────────────────────────── ? How would you like to specify the site? ❯ Provide a sitemap URL directly Provide a site root URL (auto-discover sitemap) ? Sitemap URL: https://example.com/sitemap.xml ? Output directory: ./out ? Options (space to select): Save raw HTML (--save-html) Download images (--download-images) Stop on first error (--fail-fast) ──────────────────────────────────────────────── Starting crawl of https://example.com/sitemap.xml 42 URLs found

Guided setup for your first crawl

Run the guided wizard and Crawlboy will ask for the sitemap source, output directory, and optional settings—no flag memorization required.

Powered by questionary and Rich for a polished terminal experience.

$ crawlboy -i
Output Structure

From sitemap URLs to a folder of Markdown

output structure
out/
├── md/
│ ├── index.md ← site root /
│ ├── blog/
│ │ └── articles/
│ │ └── basic-git-commands.md
│ └── docs/
│ ├── getting-started.md
│ └── api-reference.md
├── html/ ← with --save-html
│ └── ... (mirrors md/ structure)
├── media/ ← with --download-images
│ ├── a3f8c1d2.png
│ └── b9e2a7f1.jpg
└── errors.jsonl ← crawl failures
1

URL paths become file paths

Each URL maps under your output directory’s md/ folder, mirroring the site hierarchy; / becomes index.md. For example, /blog/articles/basic-git-commands/ becomes out/md/blog/articles/basic-git-commands.md.

2

Optional HTML mirror

Add --save-html to write raw HTML beside Markdown under html/, mirroring the same directory structure.

3

Content-addressed images

Add --download-images to store assets in media/ with deduplicated content hashes—duplicates across pages are stored only once.

!

Failure tracking

Failed pages append to errors.jsonl; add --fail-fast to stop immediately on the first error instead.

Get started

Install in two commands

install via pip
$ pip install crawlboy
$ crawl4ai-setup
Run crawl4ai-doctor to verify the browser setup is correct.
# Example usage:
$ crawlboy --sitemap-url 'https://example.com/sitemap.xml' --out-dir ./out
All flags

CLI flags (complete reference)

Every crawl requires an output directory (--out-dir) and either a direct sitemap (--sitemap-url) or a site root for discovery (--site-url).

Flag Description Default
--sitemap-url Direct sitemap URL to crawl
--site-url Site root URL — sitemap is auto-discovered from robots.txt or common paths
--out-dir Directory to write output files into
--delay Seconds to wait between each page crawl 0
--page-timeout-ms Navigation timeout in milliseconds per page 60000
--max-urls Cap the number of URLs to crawl unlimited
--save-html Write raw HTML files under html/ in addition to Markdown off
--download-images Save images to media/, content-addressed and deduplicated off
--no-headless Show the browser window while crawling (disables headless mode) off
--fail-fast Stop the entire crawl on the first page error off
--include-offsite-urls Allow crawling URLs on hosts outside the origin site off
-i, --interactive Launch the guided wizard instead of using flags off
FAQ

Frequently asked questions

What is Crawlboy?

Crawlboy is a Python command-line tool that crawls every URL listed in a website sitemap—including nested sitemap indexes—and saves each page as Markdown using Crawl4AI.

How do I install Crawlboy?

Run pip install crawlboy, then crawl4ai-setup to install Playwright and Chromium. Use crawl4ai-doctor to verify your browser setup.

How do I crawl a site from its root URL?

Run crawlboy --site-url 'https://example.com' --out-dir ./out. Crawlboy discovers the sitemap from robots.txt or common paths.

How do I crawl a known sitemap URL?

Run crawlboy --sitemap-url 'https://example.com/sitemap.xml' --out-dir ./out.

Does Crawlboy support nested sitemap indexes?

Yes. Crawlboy follows sitemap index files and expands nested sitemapindex entries recursively.

What files are created in the output directory?

By default you get md/ with one .md file per URL and errors.jsonl for failures. Use --save-html for html/ and --download-images for media/.

Ship a Markdown mirror today

Install from PyPI, run crawl4ai-setup once, then crawl your sitemap with flags or interactive mode.

$ pip install crawlboy
open_in_new View source on GitHub