Scraper

Documentation

Endpoint

POST https://api.i-as.dev/api/scrape

Body Request

The request body must be a JSON object containing the following parameters:

baseUrl: The target base URL from which data will be scraped.
- If the URL includes a {page} parameter, use {page} as a placeholder.
pages: The value for the page parameter (used for pagination if the target URL requires it).
selector: A list of selectors used to extract data from the page. It consists of the following sub-parameters:
- parent: The CSS selector for the parent element that encapsulates the desired data.
- children: A list of children elements to extract data from the parent. Each child can be:
  - A direct element selector (e.g., h2, span).
  - A special attribute with @{attr} to fetch the attribute data (e.g., @href, @src).
  - A customized name using <{custom-name}> for result field names.
  - A .replace operation to manipulate the string (e.g., .replace('/assets/', '/pictures/')).

Request Body Example

{
  "baseUrl": "https://i-as.dev/blog?page={page}",
  "pages": 1,
  "selector": [
    {
      "parent": "article",
      "children": [
        { "title": "h2" },
        { "date": "span.bg-gradient-to-r" },
        { "author": "a.bg-gradient-to-r" },
        { "@src": "img.replace('/assets/', '/pictures/')" },
        { "@href": "a:eq(1).replace('/blog/', 'news/')" }
      ]
    }
  ]
}

In this example:

The baseUrl is the target URL for scraping, with {page} as a placeholder for pagination.
The pages parameter is set to 1 to scrape the first page.
The selector array contains data extraction instructions:
- title: Extracted from the h2 element inside the article element.
- date: Extracted from span.bg-gradient-to-r.
- author: Extracted from the first a.bg-gradient-to-r inside the article.
- images: Extracts the src attribute from the img tag and applies a string replace to modify the path.
- link: Extracts the href attribute from the second a tag and modifies the URL using .replace().

Response

The response will return a JSON object with the scraped data in an array format under the data field. Each item in the array will represent one piece of extracted data.

{
  "data": [
    {
      "title": "string",
      "date": "string",
      "author": "string",
      "images": "string",
      "link": "string"
    },
    // More data
  ]
}

Each data object will contain:

title: The extracted title text.
date: The extracted date text.
author: The extracted author name.
images: The modified image URL after applying the .replace() operation.
link: The modified URL after applying the .replace() operation.

Advanced Example Explanation

In the advanced example provided:

The baseUrl is set to https://i-as.dev/blog?page={page} to scrape the blog pages with pagination.
The pages parameter is set to 1 to scrape only the first page.
The selector defines how to extract the data:
- parent: "article": Targets each article element as the parent.
- children: Contains child elements that should be extracted from the parent article:
  - title: Extracted from the h2 tag inside the article.
  - date: Extracted from the span.bg-gradient-to-r element.
  - author: Extracted from the first a.bg-gradient-to-r element.
  - @src: Extracts the src attribute from the img tag and replaces /assets/ with /pictures/ in the image URL.
  - @href: Extracts the href attribute from the second a tag and replaces /blog/ with news/ in the URL.

This allows you to dynamically scrape and manipulate the data from a website.