Scrape Configuration Reference

The scrape configuration defines how PageSieve should interact with a webpage to extract data. It is validated using Zod schemas defined in src/types.ts.

Metadata

The metadata section provides information about the configuration itself:

  • id: Unique identifier for the configuration.
  • description: Optional description of what the configuration does.
  • url: The target URL for this configuration.
  • version: Version number for the configuration.
  • author: Optional author name.

Extraction Options

Customize how the extraction process behaves:

  • waitforNetworkIdle: Whether to wait for network activity to stop before starting extraction.
  • scrollToBottom: Whether to scroll to the bottom of the page before extraction.
  • runJavaScript: Whether to execute JavaScript on the page.
  • delayMs: Delay in milliseconds before extraction starts.
  • timeoutMs: Maximum time in milliseconds to wait for extraction to finish.
  • appendData: Whether to append new data to existing results or start fresh.
  • maxRetries: How many times to retry when a recoverable error is encountered.

Selectors

Selectors define the data fields to extract:

  • id: Unique ID for the selector.
  • name: The field name used in the exported data.
  • selector: The CSS or XPath selector for the element.
  • type: Either single or array for multiple matches.
  • description: Optional field description.

Selectors can be grouped using SelectorGroup, which can also specify a container selector to scope its fields.

Pagination

Configure how PageSieve navigates between pages:

  • none: No pagination.
  • next: Navigate using a “Next” button selector.
    • nextSelector: CSS or XPath selector for the next button.
    • maxPages: Optional limit for the number of pages to navigate.
  • links: Extract data from a pre-defined list of links.
    • pageLinks: A list of URLs to visit sequentially.
  • template: Generate page URLs based on a template.
    • urlTemplate: URL containing {page} placeholder.
    • startPage: The starting page number (default: 1).
    • increment: The amount to increment the page number (default: 1).
    • maxPages: Maximum number of pages to scrape.