Scraper
Convert any webpage into clean, AI-ready content with a single API call.
Convert any webpage into clean, AI-ready content with a single API call.
The Scraper API lets you extract content from any public URL and receive it in multiple formats including Markdown, HTML, or raw HTML. It automatically handles JavaScript-rendered pages, blocks ads and cookie popups, and extracts just the main content you need. Whether you're building a RAG pipeline, monitoring competitor websites, collecting training data, or archiving web content, the Scraper API provides reliable, structured output ready for downstream AI processing.
Authentication
All requests require an API key passed as a Bearer token in the
Authorization header:
Authorization: Bearer YOUR_API_KEYOutput formats
Choose from multiple output formats to get exactly the data you need:
| Format | Description |
|---|---|
markdown | Clean, LLM-ready text content of the page |
html | Processed HTML with scripts/styles removed |
rawHtml | Original unmodified HTML |
links | Array of all URLs found on the page |
Highlights
- Clean output: extract just the main content, excluding headers, navigation, and footers.
- Multiple formats: get content as Markdown, HTML, raw HTML, or extract all links.
- Ad blocking: automatically blocks ads and removes cookie popups.
API endpoint
POST https://api.greenpt.ai/v1/tools/crawl/scrapeBasic example
A simple request to scrape a webpage and get markdown content.
curl -X POST https://api.greenpt.ai/v1/tools/crawl/scrape \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com"
}'const response = await fetch(
'https://api.greenpt.ai/v1/tools/crawl/scrape',
{
method: 'POST',
headers: {
Authorization: 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json',
},
body: JSON.stringify({
url: 'https://example.com',
}),
},
);
const result = await response.json();
console.log(result);import requests
url = "https://api.greenpt.ai/v1/tools/crawl/scrape"
payload = {
"url": "https://example.com"
}
headers = {
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
}
response = requests.post(url, json=payload, headers=headers)
print(response.json())Advanced example
A more advanced request with multiple formats and configuration options.
curl -X POST https://api.greenpt.ai/v1/tools/crawl/scrape \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": ["markdown", "links"],
"onlyMainContent": true,
"mobile": false,
"timeout": 60000,
"blockAds": true
}'const response = await fetch(
'https://api.greenpt.ai/v1/tools/crawl/scrape',
{
method: 'POST',
headers: {
Authorization: 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json',
},
body: JSON.stringify({
url: 'https://example.com',
formats: ['markdown', 'links'],
onlyMainContent: true,
mobile: false,
timeout: 60000,
blockAds: true,
}),
},
);
const result = await response.json();
console.log(result);import requests
url = "https://api.greenpt.ai/v1/tools/crawl/scrape"
payload = {
"url": "https://example.com",
"formats": ["markdown", "links"],
"onlyMainContent": True,
"mobile": False,
"timeout": 60000,
"blockAds": True
}
headers = {
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
}
response = requests.post(url, json=payload, headers=headers)
print(response.json())Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
url | string | Yes | The URL to scrape (http or https). |
formats | array | No | Output formats: "markdown", "html", "rawHtml", "links". Default: ["markdown"]. |
onlyMainContent | boolean | No | Exclude headers, navs, and footers from output. |
includeTags | array | No | HTML tags to include in output. |
excludeTags | array | No | HTML tags to exclude from output. |
waitFor | integer | No | Delay in ms before fetching content. |
mobile | boolean | No | Emulate mobile device. |
skipTlsVerification | boolean | No | Skip TLS certificate verification. |
timeout | integer | No | Request timeout in ms (max: 300000). |
removeBase64Images | boolean | No | Remove base64 encoded images from output. |
blockAds | boolean | No | Enable ad-blocking and cookie popup removal. |
Response format
{
"success": true,
"data": {
"markdown": "# Page Title\n\nPage content in markdown...",
"html": "<!DOCTYPE html><html>...</html>",
"links": [
"https://example.com/page1",
"https://example.com/page2"
],
"metadata": {
"title": "Page Title",
"description": "Page description",
"language": "en",
"sourceURL": "https://example.com",
"statusCode": 200
}
}
}Use cases
- AI content processing: convert webpages to markdown for LLM training or RAG pipelines.
- Content monitoring: monitor websites for changes and updates.
- Data collection: extract content from web pages for analysis.
- Link discovery: find all links on a page for crawling or SEO analysis.
- Content archiving: save webpage content in clean markdown or HTML format.