Scraper

POST

Convert any webpage into clean, AI-ready content with a single API call.

The Scraper API lets you extract content from any public URL and receive it in multiple formats including Markdown, HTML, or raw HTML. It automatically handles JavaScript-rendered pages, blocks ads and cookie popups, and extracts just the main content you need. Whether you're building a RAG pipeline, monitoring competitor websites, collecting training data, or archiving web content, the Scraper API provides reliable, structured output ready for downstream AI processing.

Authentication

All requests require an API key passed as a Bearer token in the Authorization header:

Authorization: Bearer YOUR_API_KEY

Output formats

Choose from multiple output formats to get exactly the data you need:

Format	Description
`markdown`	Clean, LLM-ready text content of the page
`html`	Processed HTML with scripts/styles removed
`rawHtml`	Original unmodified HTML
`links`	Array of all URLs found on the page

Highlights

Clean output: extract just the main content, excluding headers, navigation, and footers.
Multiple formats: get content as Markdown, HTML, raw HTML, or extract all links.
Ad blocking: automatically blocks ads and removes cookie popups.

API endpoint

POST https://api.greenpt.ai/v1/tools/crawl/scrape

Basic example

A simple request to scrape a webpage and get markdown content.

curl -X POST https://api.greenpt.ai/v1/tools/crawl/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com"
  }'

const response = await fetch(
  'https://api.greenpt.ai/v1/tools/crawl/scrape',
  {
    method: 'POST',
    headers: {
      Authorization: 'Bearer YOUR_API_KEY',
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      url: 'https://example.com',
    }),
  },
);

const result = await response.json();
console.log(result);

import requests

url = "https://api.greenpt.ai/v1/tools/crawl/scrape"

payload = {
  "url": "https://example.com"
}
headers = {
  "Authorization": "Bearer YOUR_API_KEY",
  "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

print(response.json())

Advanced example

A more advanced request with multiple formats and configuration options.

curl -X POST https://api.greenpt.ai/v1/tools/crawl/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown", "links"],
    "onlyMainContent": true,
    "mobile": false,
    "timeout": 60000,
    "blockAds": true
  }'

const response = await fetch(
  'https://api.greenpt.ai/v1/tools/crawl/scrape',
  {
    method: 'POST',
    headers: {
      Authorization: 'Bearer YOUR_API_KEY',
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      url: 'https://example.com',
      formats: ['markdown', 'links'],
      onlyMainContent: true,
      mobile: false,
      timeout: 60000,
      blockAds: true,
    }),
  },
);

const result = await response.json();
console.log(result);

import requests

url = "https://api.greenpt.ai/v1/tools/crawl/scrape"

payload = {
  "url": "https://example.com",
  "formats": ["markdown", "links"],
  "onlyMainContent": True,
  "mobile": False,
  "timeout": 60000,
  "blockAds": True
}
headers = {
  "Authorization": "Bearer YOUR_API_KEY",
  "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

print(response.json())

Parameters

Parameter	Type	Required	Description
`url`	string	Yes	The URL to scrape (`http` or `https`).
`formats`	array	No	Output formats: `"markdown"`, `"html"`, `"rawHtml"`, `"links"`. Default: `["markdown"]`.
`onlyMainContent`	boolean	No	Exclude headers, navs, and footers from output.
`includeTags`	array	No	HTML tags to include in output.
`excludeTags`	array	No	HTML tags to exclude from output.
`waitFor`	integer	No	Delay in ms before fetching content.
`mobile`	boolean	No	Emulate mobile device.
`skipTlsVerification`	boolean	No	Skip TLS certificate verification.
`timeout`	integer	No	Request timeout in ms (max: `300000`).
`removeBase64Images`	boolean	No	Remove base64 encoded images from output.
`blockAds`	boolean	No	Enable ad-blocking and cookie popup removal.

Response format

{
  "success": true,
  "data": {
    "markdown": "# Page Title\n\nPage content in markdown...",
    "html": "<!DOCTYPE html><html>...</html>",
    "links": [
      "https://example.com/page1",
      "https://example.com/page2"
    ],
    "metadata": {
      "title": "Page Title",
      "description": "Page description",
      "language": "en",
      "sourceURL": "https://example.com",
      "statusCode": 200
    }
  }
}

Use cases

AI content processing: convert webpages to markdown for LLM training or RAG pipelines.
Content monitoring: monitor websites for changes and updates.
Data collection: extract content from web pages for analysis.
Link discovery: find all links on a page for crawling or SEO analysis.
Content archiving: save webpage content in clean markdown or HTML format.

On this page