Crawlboy - Site to Markdown CLI (Python, Crawl4AI) using sitemap

What's inside

Why teams use Crawlboy

Batch static sites, docs portals, and blogs into a structured Markdown corpus-ready for search, RAG pipelines, mirrors, and offline reading-without building a custom crawler.

travel_explore

Sitemap discovery that “just finds it”

Auto-detect the sitemap from robots.txt or common paths. Follows nested <sitemapindex> entries recursively. Point at a site root or paste a sitemap URL-Crawlboy resolves the rest.

$ --site-url https://example.com

article

Markdown output with stable paths

One .md file per page, mirroring the URL path. Optional raw HTML under html/ with --save-html.

$ --out-dir ./output

image

Images without duplicates

Save images to media/, content-addressed and deduplicated. Enable with --download-images.

$ --download-images

Interactive Mode

Prefer prompts to flags? crawlboy -i walks through sitemap source, output directory, and options before the crawl starts.

$ crawlboy --interactive

interactive wizard

$ crawlboy -i ██████╗██████╗ █████╗ ██╗ ██╗██╗ ██████╗ ██████╗ ██╗ ██╗ ██╔════╝██╔══██╗██╔══██╗██║ ██║██║ ██╔══██╗██╔═══██╗╚██╗ ██╔╝ ██║ ██████╔╝███████║██║ █╗ ██║██║ ██████╔╝██║ ██║ ╚████╔╝ ██║ ██╔══██╗██╔══██║██║███╗██║██║ ██╔══██╗██║ ██║ ╚██╔╝ ╚██████╗██║ ██║██║ ██║╚███╔███╔╝███████╗██████╔╝╚██████╔╝ ██║ ╚═════╝╚═╝ ╚═╝╚═╝ ╚═╝ ╚══╝╚══╝ ╚══════╝╚═════╝ ╚═════╝ ╚═╝ ? How would you like to specify the site? ❯ Provide a sitemap URL directly Provide a site root URL (auto-discover sitemap) ? Sitemap URL: https://example.com/sitemap.xml ? Output directory: ./out ? Options (space to select): ◉ Save raw HTML (--save-html) ◯ Download images (--download-images) ◯ Stop on first error (--fail-fast) Starting crawl of https://example.com/sitemap.xml ✓ 42 URLs found

Guided setup for your first crawl

Run the guided wizard and Crawlboy will ask for the sitemap source, output directory, and optional settings-no flag memorization required.

Powered by questionary and Rich for a polished terminal experience.

$ crawlboy -i

Output Structure

From sitemap URLs to a folder of Markdown

output structure

out/

├── md/

│ ├── index.md ← site root /

│ ├── blog/

│ │ └── articles/

│ │ └── basic-git-commands.md

│ └── docs/

│ ├── getting-started.md

│ └── api-reference.md

├── html/ ← with --save-html

│ └── ... (mirrors md/ structure)

├── media/ ← with --download-images

│ ├── a3f8c1d2.png

│ └── b9e2a7f1.jpg

└── errors.jsonl ← crawl failures

1

URL paths become file paths

Each URL maps under your output directory’s md/ folder, mirroring the site hierarchy; / becomes index.md. For example, /blog/articles/basic-git-commands/ becomes out/md/blog/articles/basic-git-commands.md.

2

Optional HTML mirror

Add --save-html to write raw HTML beside Markdown under html/, mirroring the same directory structure.

3

Content-addressed images

Add --download-images to store assets in media/ with deduplicated content hashes-duplicates across pages are stored only once.

!

Failure tracking

Failed pages append to errors.jsonl; add --fail-fast to stop immediately on the first error instead.

Get started

Install in two commands

install via pip

$ pip install crawlboy

$ crawl4ai-setup

Run crawl4ai-doctor to verify the browser setup is correct.

# Example usage:

$ crawlboy --sitemap-url 'https://example.com/sitemap.xml' --out-dir ./out

All flags

CLI flags (complete reference)

Every crawl requires an output directory (--out-dir) and either a direct sitemap (--sitemap-url) or a site root for discovery (--site-url).

Flag	Description	Default
`--sitemap-url`	Direct sitemap URL to crawl	`-`
`--site-url`	Site root URL - sitemap is auto-discovered from robots.txt or common paths	`-`
`--out-dir`	Directory to write output files into	`-`
`--delay`	Seconds to wait between each page crawl	`0`
`--page-timeout-ms`	Navigation timeout in milliseconds per page	`60000`
`--max-urls`	Cap the number of URLs to crawl	`unlimited`
`--save-html`	Write raw HTML files under `html/` in addition to Markdown	`off`
`--download-images`	Save images to `media/`, content-addressed and deduplicated	`off`
`--no-headless`	Show the browser window while crawling (disables headless mode)	`off`
`--fail-fast`	Stop the entire crawl on the first page error	`off`
`--include-offsite-urls`	Allow crawling URLs on hosts outside the origin site	`off`
`-i, --interactive`	Launch the guided wizard instead of using flags	`off`

--sitemap-url

Direct sitemap URL to crawl

--site-url

Site root URL - sitemap is auto-discovered from robots.txt or common paths

--out-dir

Directory to write output files into

--delay 0

Seconds to wait between each page crawl

--page-timeout-ms 60000

Navigation timeout in milliseconds per page

--max-urls unlimited

Cap the number of URLs to crawl

--save-html off

Write raw HTML files under html/ in addition to Markdown

--download-images off

Save images to media/, content-addressed and deduplicated

--no-headless off

Show the browser window while crawling (disables headless mode)

--fail-fast off

Stop the entire crawl on the first page error

--include-offsite-urls off

Allow crawling URLs on hosts outside the origin site

-i, --interactive off

Launch the guided wizard instead of using flags

Battle-tested

Verified on real sites

Not just a demo-Crawlboy has been tested against production websites. Here's an actual run against a live site, redirects and all.

5/5

Pages crawled

0

Failures

~13s

Total time

12.7 KB

Output size

verified run - zsh

$ crawlboy --site-url 'https://aksharahegde.xyz' --out-dir ./test-output --max-urls 5

Discovering sitemap from robots.txt ...
✓ Found sitemap at robots.txt
✓ Redirect handled: aksharahegde.xyz → www.aksharahegde.xyz

Crawling 5 URLs ...
✓ / → index.md 3.1 KB
✓ /blog → blog.md 3.2 KB
✓ /projects → projects.md 5.3 KB
✓ /resources → resources.md 628 B
✓ /shop → shop.md 564 B

Done. 5 pages crawled in ~13s · 0 failures

folder test-output/md/

index.md - homepage content
blog.md - blog page
projects.md - projects page
resources.md - resources page
shop.md - shop page

FAQ

Frequently asked questions

What is Crawlboy?

Crawlboy is a Python command-line tool that crawls every URL listed in a website sitemap-including nested sitemap indexes-and saves each page as Markdown using Crawl4AI.

How do I install Crawlboy?

Run pip install crawlboy, then crawl4ai-setup to install Playwright and Chromium. Use crawl4ai-doctor to verify your browser setup.

How do I crawl a site from its root URL?

Run crawlboy --site-url 'https://example.com' --out-dir ./out. Crawlboy discovers the sitemap from robots.txt or common paths.

How do I crawl a known sitemap URL?

Run crawlboy --sitemap-url 'https://example.com/sitemap.xml' --out-dir ./out.

Does Crawlboy support nested sitemap indexes?

Yes. Crawlboy follows sitemap index files and expands nested sitemapindex entries recursively.

What files are created in the output directory?

By default you get md/ with one .md file per URL and errors.jsonl for failures. Use --save-html for html/ and --download-images for media/.

Ship a Markdown mirror today

Install from PyPI, run crawl4ai-setup once, then crawl your sitemap with flags or interactive mode.