Scraping Engine

The main workflow of the web extension is controlled by a state machine implemented in src/scrapeMachine.ts. Below is a diagram of the states and transitions of the state machine created with Excalidraw. View SVG

Scraping Engine Workflow

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Scraping Engine Workflow
  1. `START` event
  2. No pagination, complete.
  3. Dispatch pagination
  4. Max pages reached, wait for `DELAY_MS` and move to completed.
  5. Pagination not complete, wait and move to next page
  6. Dispatch to correct pagination handler
  7. Template URL base pagination.
  8. Pagination succeeded, move to extraction.
  9. Navigation is in test mode or has completed.
  10. Failed to paginate, retry.
  11. Pagination failed completely.
  12. Link list based pagination
  13. Navigation is in test mode or has completed.
  14. Pagination succeeded, move to extraction.
  15. Failed to paginate, retry.
  16. Pagination failed completely.
  17. Next button based pagination
  18. Compute hash of current page
  19. Try and click next selector element to trigger navigation.
  20. Next page is in Single Page Application (SPA) mode, so compute hash of page to compare.
  21. Pagination succeeded, move to extraction.
  22. Navigation is in test mode or has completed.
  23. Failed to paginate, retry.
  24. Pagination failed completely.
  25. Navigation is in test mode or has completed.
  26. Pagination succeeded, move to extraction.
  27. Pagination failed completely.
  28. Retry pagination
  29. Extraction failed, retry.
  30. Try extraction after waiting for delay.
  31. Extraction failed completely.