POST https://api.i-as.dev/api/scrape
The request body must be a JSON object containing the following parameters:
baseUrl: The target base URL from which data will be scraped.
{page}
parameter, use {page}
as a placeholder.pages: The value for the page parameter (used for pagination if the target URL requires it).
selector: A list of selectors used to extract data from the page. It consists of the following sub-parameters:
h2
, span
).@{attr}
to fetch the attribute data (e.g., @href
, @src
).<{custom-name}>
for result field names..replace
operation to manipulate the string (e.g., .replace('/assets/', '/pictures/')
).{
"baseUrl": "https://i-as.dev/blog?page={page}",
"pages": 1,
"selector": [
{
"parent": "article",
"children": [
{ "title": "h2" },
{ "date": "span.bg-gradient-to-r" },
{ "author": "a.bg-gradient-to-r" },
{ "@src": "img.replace('/assets/', '/pictures/')" },
{ "@href": "a:eq(1).replace('/blog/', 'news/')" }
]
}
]
}
In this example:
baseUrl
is the target URL for scraping, with {page}
as a placeholder for pagination.pages
parameter is set to 1
to scrape the first page.selector
array contains data extraction instructions:h2
element inside the article
element.span.bg-gradient-to-r
.a.bg-gradient-to-r
inside the article
.src
attribute from the img
tag and applies a string replace to modify the path.href
attribute from the second a
tag and modifies the URL using .replace()
.The response will return a JSON object with the scraped data in an array format under the data
field. Each item in the array will represent one piece of extracted data.
{
"data": [
{
"title": "string",
"date": "string",
"author": "string",
"images": "string",
"link": "string"
},
// More data
]
}
Each data object will contain:
.replace()
operation..replace()
operation.In the advanced example provided:
baseUrl
is set to https://i-as.dev/blog?page={page}
to scrape the blog pages with pagination.pages
parameter is set to 1
to scrape only the first page.selector
defines how to extract the data:parent: "article"
: Targets each article
element as the parent.children
: Contains child elements that should be extracted from the parent article
:title
: Extracted from the h2
tag inside the article.date
: Extracted from the span.bg-gradient-to-r
element.author
: Extracted from the first a.bg-gradient-to-r
element.@src
: Extracts the src
attribute from the img
tag and replaces /assets/
with /pictures/
in the image URL.@href
: Extracts the href
attribute from the second a
tag and replaces /blog/
with news/
in the URL.This allows you to dynamically scrape and manipulate the data from a website.