The scrape configuration defines how PageSieve should interact with a webpage to extract data. It is validated using Zod schemas defined in packages/core/src/schema.ts.
ScrapeConfig
id |
string |
Yes |
"" |
Auto-generated id. |
name |
string |
No |
- |
Human readable name for config. |
schemaVersion |
string |
Yes |
"2.0.0" |
Schema version. Incremented when breaking changes to format made. |
revision |
number |
Yes |
1 |
Increment when config is modified. |
url |
string |
Yes |
- |
URL to start extraction. |
urlPattern |
string |
No |
- |
glob/pattern this config applies to; omitted = exact url only |
createdAt |
string |
Yes |
- |
When config was created. |
updatedAt |
string |
Yes |
- |
When config was updated. Should change alongside revision. |
description |
string |
No |
- |
Description of config e.g what type of data it extracts |
author |
string |
No |
- |
Who made this config. Typically username |
tags |
string[] |
No |
- |
Content tags for easier categorization. |
selectors |
SelectorGroup[] |
Yes |
- |
Defines what to extract from page. |
options |
ExtractionOptions |
Yes |
- |
Options that control how extraction is done |
pagination |
PaginationConfig |
Yes |
- |
Controls how to move to the next page |
variables |
VariableConfig |
No |
- |
- |
output |
OutputConfig |
No |
- |
- |
SelectorGroup
id |
string |
Yes |
"g_${nodeid(6)}" |
Random and stable id. Used to group results when multiple groups exist. |
name |
string |
Yes |
"Group 1" |
- |
container |
string |
No |
- |
Item container selector when omitted => page-level group (scrape once) |
fields |
Field[] |
Yes |
- |
- |
Field
id |
string |
Yes |
"f_${nodeid(6)}" |
Random and stable id. |
name |
string |
Yes |
"" |
Field name used as column key in results. |
selector |
string |
Yes |
"" |
Selector for data point. |
type |
single or multiple or count |
Yes |
"single" |
How to treat selector. Individual, collection or count |
extract |
text or property or attribute |
Yes |
"text" |
What part of element to extract. |
attribute |
string |
No |
- |
Which attribute to extract. Required when extract === ‘attribute’. |
property |
innerHTML or outerHTML or innerText or textContent |
No |
- |
Which property to extract. Required when extract === ‘property’. |
required |
boolean |
Yes |
false |
Should absence of element be treated as error. |
default |
string |
No |
- |
Default value for this element when it cannot be extracted. |
datatype |
string or number or boolean or date |
No |
- |
Datatype to cast result to. |
fields |
Field[] |
No |
- |
Recursive sub-fields. Only valid when type === ‘multiple’ |
VariableConfig
Variables either embedded as string or secret for suggesting to client to prompt user.
type |
secret or string |
Yes |
- |
Type of variable |
value |
string |
No |
- |
Value of text variable |
OutputConfig
Results output hint.
format |
string |
Yes |
json |
Data format to use for results |
mergeStrategy |
string |
Yes |
join |
How to combine results in different field groups. |
flatten |
boolean |
Yes |
true |
Flatten nested columns. |