api
Scraper

Documentation

Endpoint

POST https://api.i-as.dev/api/scrape

Body Request

The request body must be a JSON object containing the following parameters:

  • baseUrl: The target base URL from which data will be scraped.

    • If the URL includes a {page} parameter, use {page} as a placeholder.
  • pages: The value for the page parameter (used for pagination if the target URL requires it).

  • selector: A list of selectors used to extract data from the page. It consists of the following sub-parameters:

    • parent: The CSS selector for the parent element that encapsulates the desired data.
    • children: A list of children elements to extract data from the parent. Each child can be:
      • A direct element selector (e.g., h2, span).
      • A special attribute with @{attr} to fetch the attribute data (e.g., @href, @src).
      • A customized name using <{custom-name}> for result field names.
      • A .replace operation to manipulate the string (e.g., .replace('/assets/', '/pictures/')).

Request Body Example

{
  "baseUrl": "https://i-as.dev/blog?page={page}",
  "pages": 1,
  "selector": [
    {
      "parent": "article",
      "children": [
        { "title": "h2" },
        { "date": "span.bg-gradient-to-r" },
        { "author": "a.bg-gradient-to-r" },
        { "@src": "img.replace('/assets/', '/pictures/')" },
        { "@href": "a:eq(1).replace('/blog/', 'news/')" }
      ]
    }
  ]
}

In this example:

  • The baseUrl is the target URL for scraping, with {page} as a placeholder for pagination.
  • The pages parameter is set to 1 to scrape the first page.
  • The selector array contains data extraction instructions:
    • title: Extracted from the h2 element inside the article element.
    • date: Extracted from span.bg-gradient-to-r.
    • author: Extracted from the first a.bg-gradient-to-r inside the article.
    • images: Extracts the src attribute from the img tag and applies a string replace to modify the path.
    • link: Extracts the href attribute from the second a tag and modifies the URL using .replace().

Response

The response will return a JSON object with the scraped data in an array format under the data field. Each item in the array will represent one piece of extracted data.

{
  "data": [
    {
      "title": "string",
      "date": "string",
      "author": "string",
      "images": "string",
      "link": "string"
    },
    // More data
  ]
}

Each data object will contain:

  • title: The extracted title text.
  • date: The extracted date text.
  • author: The extracted author name.
  • images: The modified image URL after applying the .replace() operation.
  • link: The modified URL after applying the .replace() operation.

Advanced Example Explanation

In the advanced example provided:

  • The baseUrl is set to https://i-as.dev/blog?page={page} to scrape the blog pages with pagination.
  • The pages parameter is set to 1 to scrape only the first page.
  • The selector defines how to extract the data:
    • parent: "article": Targets each article element as the parent.
    • children: Contains child elements that should be extracted from the parent article:
      • title: Extracted from the h2 tag inside the article.
      • date: Extracted from the span.bg-gradient-to-r element.
      • author: Extracted from the first a.bg-gradient-to-r element.
      • @src: Extracts the src attribute from the img tag and replaces /assets/ with /pictures/ in the image URL.
      • @href: Extracts the href attribute from the second a tag and replaces /blog/ with news/ in the URL.

This allows you to dynamically scrape and manipulate the data from a website.