Turn any sitemap into clean Markdown.
Built for Crawl4AI workflows.
Crawlboy reads every URL from a sitemap (including
nested
<sitemapindex>
files), renders each page with Crawl4AI, and writes
one
.md
file per URL-mirroring paths under your output
directory’s
md/
folder. Add raw HTML, downloaded images, and failure
logs using the flags in the CLI reference below.
Open source under MIT.
Why teams use Crawlboy
Batch static sites, docs portals, and blogs into a structured Markdown corpus-ready for search, RAG pipelines, mirrors, and offline reading-without building a custom crawler.
Sitemap discovery that “just finds it”
Auto-detect the sitemap from
robots.txt or common paths.
Follows nested
<sitemapindex> entries
recursively. Point at a site root or paste a
sitemap URL-Crawlboy resolves the rest.
Markdown output with stable paths
One .md file per page,
mirroring the URL path. Optional raw HTML
under html/ with
--save-html.
Images without duplicates
Save images to media/,
content-addressed and deduplicated. Enable
with --download-images.
Prefer prompts to flags?
crawlboy -i
walks through sitemap source, output directory, and
options before the crawl starts.
$ crawlboy --interactive
ASCII art banner spelling Crawlboy, followed by an example interactive wizard that asks for a sitemap URL, output directory, and optional flags.
? How would you like to specify the site? ❯ Provide a sitemap URL directly Provide a site root URL (auto-discover sitemap) ? Sitemap URL: https://example.com/sitemap.xml ? Output directory: ./out ? Options (space to select): ◉ Save raw HTML (--save-html) ◯ Download images (--download-images) ◯ Stop on first error (--fail-fast) Starting crawl of https://example.com/sitemap.xml ✓ 42 URLs foundGuided setup for your first crawl
Run the guided wizard and Crawlboy will ask for the sitemap source, output directory, and optional settings-no flag memorization required.
Powered by
questionary
and
Rich
for a polished terminal experience.
From sitemap URLs to a folder of Markdown
URL paths become file paths
Each URL maps under your output directory’s
md/
folder, mirroring the site hierarchy;
/
becomes
index.md. For example,
/blog/articles/basic-git-commands/
becomes
out/md/blog/articles/basic-git-commands.md.
Optional HTML mirror
Add
--save-html
to write raw HTML beside Markdown under
html/, mirroring the same directory structure.
Content-addressed images
Add
--download-images
to store assets in
media/
with deduplicated content hashes-duplicates
across pages are stored only once.
Failure tracking
Failed pages append to
errors.jsonl; add
--fail-fast
to stop immediately on the first error
instead.
Install in two commands
--sitemap-url https://example.com/sitemap.xml \
--out-dir /out
CLI flags (complete reference)
Every crawl requires an output directory (--out-dir) and either a direct sitemap (--sitemap-url) or a site root for discovery (--site-url).
--sitemap-url
Direct sitemap URL to crawl
--site-url
Site root URL - sitemap is auto-discovered from robots.txt or common paths
--out-dir
Directory to write output files into
--delay
0
Seconds to wait between each page crawl
--page-timeout-ms
60000
Navigation timeout in milliseconds per page
--max-urls
unlimited
Cap the number of URLs to crawl
--save-html
off
Write raw HTML files under
html/ in addition to
Markdown
--download-images
off
Save images to media/,
content-addressed and deduplicated
--no-headless
off
Show the browser window while crawling (disables headless mode)
--fail-fast
off
Stop the entire crawl on the first page error
--include-offsite-urls
off
Allow crawling URLs on hosts outside the origin site
-i, --interactive
off
Launch the guided wizard instead of using flags
Verified on real sites
Not just a demo-Crawlboy has been tested against production websites. Here's an actual run against a live site, redirects and all.
✓ Found sitemap at robots.txt
✓ Redirect handled: aksharahegde.xyz → www.aksharahegde.xyz
✓ / → index.md 3.1 KB
✓ /blog → blog.md 3.2 KB
✓ /projects → projects.md 5.3 KB
✓ /resources → resources.md 628 B
✓ /shop → shop.md 564 B
blog.md - blog page
projects.md - projects page
resources.md - resources page
shop.md - shop page
Frequently asked questions
What is Crawlboy?
Crawlboy is a Python command-line tool that crawls every URL listed in a website sitemap-including nested sitemap indexes-and saves each page as Markdown using Crawl4AI.
How do I install Crawlboy?
Run
pip install crawlboy, then
crawl4ai-setup
to install Playwright and Chromium. Use
crawl4ai-doctor
to verify your browser setup.
How do I crawl a site from its root URL?
Run
crawlboy --site-url 'https://example.com'
--out-dir ./out. Crawlboy discovers the sitemap from robots.txt or
common paths.
How do I crawl a known sitemap URL?
Run
crawlboy --sitemap-url
'https://example.com/sitemap.xml' --out-dir
./out.
Does Crawlboy support nested sitemap indexes?
Yes. Crawlboy follows sitemap index files and expands nested sitemapindex entries recursively.
What files are created in the output directory?
By default you get
md/
with one
.md
file per URL and
errors.jsonl
for failures. Use
--save-html
for
html/
and
--download-images
for
media/.
Ship a Markdown mirror today
Install from PyPI, run crawl4ai-setup once, then crawl your sitemap with flags or interactive mode.