GreenPT Docs

Scraper

Convert any webpage into clean, AI-ready content with a single API call.

POST

Convert any webpage into clean, AI-ready content with a single API call.

The Scraper API lets you extract content from any public URL and receive it in multiple formats including Markdown, HTML, or raw HTML. It automatically handles JavaScript-rendered pages, blocks ads and cookie popups, and extracts just the main content you need. Whether you're building a RAG pipeline, monitoring competitor websites, collecting training data, or archiving web content, the Scraper API provides reliable, structured output ready for downstream AI processing.

Authentication

All requests require an API key passed as a Bearer token in the Authorization header:

Authorization: Bearer YOUR_API_KEY

Output formats

Choose from multiple output formats to get exactly the data you need:

FormatDescription
markdownClean, LLM-ready text content of the page
htmlProcessed HTML with scripts/styles removed
rawHtmlOriginal unmodified HTML
linksArray of all URLs found on the page

Highlights

  • Clean output: extract just the main content, excluding headers, navigation, and footers.
  • Multiple formats: get content as Markdown, HTML, raw HTML, or extract all links.
  • Ad blocking: automatically blocks ads and removes cookie popups.

API endpoint

POST https://api.greenpt.ai/v1/tools/crawl/scrape

Basic example

A simple request to scrape a webpage and get markdown content.

curl -X POST https://api.greenpt.ai/v1/tools/crawl/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com"
  }'
const response = await fetch(
  'https://api.greenpt.ai/v1/tools/crawl/scrape',
  {
    method: 'POST',
    headers: {
      Authorization: 'Bearer YOUR_API_KEY',
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      url: 'https://example.com',
    }),
  },
);

const result = await response.json();
console.log(result);
import requests

url = "https://api.greenpt.ai/v1/tools/crawl/scrape"

payload = {
  "url": "https://example.com"
}
headers = {
  "Authorization": "Bearer YOUR_API_KEY",
  "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

print(response.json())

Advanced example

A more advanced request with multiple formats and configuration options.

curl -X POST https://api.greenpt.ai/v1/tools/crawl/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown", "links"],
    "onlyMainContent": true,
    "mobile": false,
    "timeout": 60000,
    "blockAds": true
  }'
const response = await fetch(
  'https://api.greenpt.ai/v1/tools/crawl/scrape',
  {
    method: 'POST',
    headers: {
      Authorization: 'Bearer YOUR_API_KEY',
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      url: 'https://example.com',
      formats: ['markdown', 'links'],
      onlyMainContent: true,
      mobile: false,
      timeout: 60000,
      blockAds: true,
    }),
  },
);

const result = await response.json();
console.log(result);
import requests

url = "https://api.greenpt.ai/v1/tools/crawl/scrape"

payload = {
  "url": "https://example.com",
  "formats": ["markdown", "links"],
  "onlyMainContent": True,
  "mobile": False,
  "timeout": 60000,
  "blockAds": True
}
headers = {
  "Authorization": "Bearer YOUR_API_KEY",
  "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

print(response.json())

Parameters

ParameterTypeRequiredDescription
urlstringYesThe URL to scrape (http or https).
formatsarrayNoOutput formats: "markdown", "html", "rawHtml", "links". Default: ["markdown"].
onlyMainContentbooleanNoExclude headers, navs, and footers from output.
includeTagsarrayNoHTML tags to include in output.
excludeTagsarrayNoHTML tags to exclude from output.
waitForintegerNoDelay in ms before fetching content.
mobilebooleanNoEmulate mobile device.
skipTlsVerificationbooleanNoSkip TLS certificate verification.
timeoutintegerNoRequest timeout in ms (max: 300000).
removeBase64ImagesbooleanNoRemove base64 encoded images from output.
blockAdsbooleanNoEnable ad-blocking and cookie popup removal.

Response format

{
  "success": true,
  "data": {
    "markdown": "# Page Title\n\nPage content in markdown...",
    "html": "<!DOCTYPE html><html>...</html>",
    "links": [
      "https://example.com/page1",
      "https://example.com/page2"
    ],
    "metadata": {
      "title": "Page Title",
      "description": "Page description",
      "language": "en",
      "sourceURL": "https://example.com",
      "statusCode": 200
    }
  }
}

Use cases

  • AI content processing: convert webpages to markdown for LLM training or RAG pipelines.
  • Content monitoring: monitor websites for changes and updates.
  • Data collection: extract content from web pages for analysis.
  • Link discovery: find all links on a page for crawling or SEO analysis.
  • Content archiving: save webpage content in clean markdown or HTML format.

On this page