Isoxya 3 deep dive: API

2022-02-07 · Isoxya

This is the first of a series of deep dives into Isoxya 3, the scalable web crawler and scraper released earlier this year. By giving a tour of the main components of both the free open-source and the commercial Pro editions, I hope to explain a little about the main concepts of the web crawling engine, to inspire about its many possibilities, and to help people get started building their own apps or services on top of it. This first post is about Isoxya 3 API, which is open-source under the BSD-3 licence, and freely available on GitHub and Docker Hub.

Overview

Isoxya API is the central program within the Isoxya ecosystem. It interfaces with other components to provide centralised setup and control. Unlike some other crawling services, Isoxya is designed to be API-first, meaning it’s possible to control everything through well-defined API calls. The API principally follows REST, and uses JSON for easy interaction with other programming languages and services. It’s intended to be a backend service sitting behind a firewall, so doesn’t have the concept of users or authentication.

The API uses a SQLite embedded database for storage, and runs as a single process with its own embedded queues. Whilst this enables it to be installed with just a couple of commands and also no dependencies, this also limits its scalability. Isoxya Pro provides an alternative API and other components, switching out SQLite for PostgreSQL, the embedded queues for RabbitMQ, and adding Redis as an in-memory cache. So, it’s possible to get started quickly on Isoxya, integrate with other products and even develop custom extensions, but then add horizontal scaling and high availability with Isoxya Pro if needed.

Isoxya API contains embedded crawlers, processors, and streamers, to crawl sites, process their data through plugins, and stream the results to third-party services. Although most of Isoxya is written in Haskell, there’s no need to know Haskell to use Isoxya, or even to extend its capabilities when writing your own plugins; all that’s needed is to conform to the relevant interfaces.

Installation

Since pretty much everything is embedded, installation is usually as simple as downloading the the container images from Docker Hub and running them. Database migrations are handled automatically, after which the API should boot and listen for requests. Isoxya is developed in multiple streams (a little like Debian), and typically you’ll want to use stable. That image tag is provided as a convenience, but in production, you should normally lock it to a specific version (e.g. 3.0.2). Docker Compose files are provided so you can get up and running quickly; you can edit or extend these as desired.

cd misc/streams/stable/
docker-compose up

This should create the Docker networks (change .env COMPOSE_PROJECT_NAME if you want to change the namespace), and pull the images for api, plugin_crawler_html, and plugin_nginx (more about the plugins in a minute):

Creating network "isoxya_api" with the default driver
Creating network "isoxya_processor" with the default driver
Creating network "isoxya_streamer" with the default driver
Pulling api (docker.io/isoxya/isoxya-api:stable)...
stable: Pulling from isoxya/isoxya-api
a024302f8a01: Already exists
0d69bb0a0de2: Pull complete
af5fc677fe4a: Pull complete
47835794a942: Pull complete
Digest: sha256:39d4bf2fd35705672689b786744b0244023d4d707af21b430cc3f2762a1eebbe
Status: Downloaded newer image for isoxya/isoxya-api:stable
Pulling plugin_crawler_html (docker.io/isoxya/isoxya-plugin-crawler-html:stable)...
stable: Pulling from isoxya/isoxya-plugin-crawler-html
a024302f8a01: Already exists
cce986a9951a: Pull complete
9e491dbb669c: Pull complete
4b02f8c26832: Pull complete
Digest: sha256:8924f35bfd685bddbb3f93818f0a42f821f9f4545bc7c4cfd075318921350314
Status: Downloaded newer image for isoxya/isoxya-plugin-crawler-html:stable
Creating isoxya_api_1                 ... done
Creating isoxya_plugin_crawler_html_1 ... done
Creating isoxya_plugin_nginx_1        ... done
Attaching to isoxya_plugin_crawler_html_1, isoxya_plugin_nginx_1, isoxya_api_1

Finally, Isoxya API and the example plugins should boot:

api_1                  | Isoxya API 3.0.2
plugin_crawler_html_1  | Isoxya plugin Crawler HTML 3.0.2
plugin_crawler_html_1  | Initializing App @ /
plugin_nginx_1         | /docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
plugin_nginx_1         | /docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
plugin_nginx_1         | /docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
plugin_crawler_html_1  | Initializing CrawlerHTML @ /
plugin_crawler_html_1  | 
plugin_nginx_1         | 10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
plugin_nginx_1         | 10-listen-on-ipv6-by-default.sh: info: /etc/nginx/conf.d/default.conf differs from the packaged version
plugin_crawler_html_1  | Listening on http://0.0.0.0:80
plugin_nginx_1         | /docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
plugin_nginx_1         | /docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
plugin_nginx_1         | /docker-entrypoint.sh: Configuration complete; ready for start up
api_1                  | Initializing App @ /
api_1                  | Initializing API @ /
api_1                  | 
api_1                  | Listening on http://0.0.0.0:80

Initialisation

Although at this point you can make your own JSON requests to the API using Curl or an alternative, you might want to use the example scripts until you’re comfortable with basic operations. These scripts are written in Bash, use Curl and Jq, and are open-source, so you can use them to learn about Isoxya or adjust them as needed.

isoxya-api-init
endpoint [http://localhost]: 
{
  "time": "2022-02-07T12:36:05.722005922Z",
  "version": "3.0.2"
}

This simply initialises the .isoxya state directory in the current working directory, and calls the API apex. If you want to keep states separate whilst interacting with multiple stacks (e.g. development and production), simply call isoxya-api-init from multiple directories. Optionally, you can set ISOXYA_DEBUG=1 as an environment variable, and you’ll get much more information, such as which API calls were made. This can be useful when examined against the routes documentation. Doing so and calling the API apex again:

export ISOXYA_DEBUG=1
isoxya-api-apex
* Rebuilt URL to: http://localhost/
*   Trying ::1...
* TCP_NODELAY set
* connect to ::1 port 80 failed: Connection refused
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 80 (#0)
> GET / HTTP/1.1
> Host: localhost
> User-Agent: curl/7.61.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< server: Snap/1.1.2.0
< content-type: application/json
< date: Mon, 07 Feb 2022 12:40:48 GMT
< transfer-encoding: chunked
< 
{ [72 bytes data]
* Connection #0 to host localhost left intact
{
  "time": "2022-02-07T12:40:48.008083202Z",
  "version": "3.0.2"
}

To be quieter again, unset ISOXYA_DEBUG.

Processors

Isoxya API contains less crawling logic than you might think; in fact, it’s industry- and product-agnostic! Whereas many crawlers embed assumptions such as what’s necessary for SEO or some other purpose, Isoxya API focusses only on the interfaces and tooling necessary for running a crawl. Interacting with this core engine, plugins allow functionality to be defined and extended. There are two types of plugins: processors, which define how to parse and extract a page, and streamers, which define how to send data to a third-party API, data warehouse, or BI system. In the example above, you’ll see that the plugin_crawler_html service runs by default. Isoxya plugin Crawler HTML is open-source, and provides a core run loop for processing pages as HTML, gathering simple metadata such as HTTP response codes and which pages to crawl next. This is also open-source and extensible, and merits an extended discussion in another post. All you need to know for now is that you’ll need to run this alongside, and register it in the API. Fields marked (Pro) can be ignored, since they configure Isoxya API Pro instead.

isoxya-api-create-processor 
channels (Pro) [null]: 
tag [crawler-html]: 
url [http://isoxya-plugin-crawler-html.localhost/data]: 
{
  "href": "/processor/777577e4-4d36-4990-9d17-c5889cbd3652",
  "tag": "crawler-html",
  "url": "http://isoxya-plugin-crawler-html.localhost/data"
}

Streamers

As mentioned above, streamers define where and how to send processed data. If you have a third-party API, you can simply define an endpoint there, and avoid running an additional service. For this post, however, we’ll use the default plugin_nginx service, which is a simple instance of NGINX with a configuration to log payloads. As well as assisting in providing an end-to-end example, this is also very useful when developing your own services. Isoxya plugin NGINX is also open-source, and requires registering in the API similar to processors.

isoxya-api-create-streamer 
channels (Pro) [null]: 
tag [nginx]: 
url [http://isoxya-plugin-nginx.localhost]: 
{
  "href": "/streamer/b8373c9a-56b8-465c-9cad-53fb92da2d3c",
  "tag": "nginx",
  "url": "http://isoxya-plugin-nginx.localhost"
}

Sites

Now that the API, processors, and streamers have been installed and registered, it’s time to introduce the concept of sites. Sites are the actual web sites being crawled or scraped. Unlike many crawlers, Isoxya is designed to crawl only a single site at once. This provides a number of advantages, such as well-defined boundaries for a site graph, and the possibility of crawling or checking other sites in parallel. Isoxya Pro takes this concept further, and allows for setting maximum rate-limits and concurrent channels per-site.

It’s worth bearing in mind that a site is considered unique according to its protocol, subdomain, and port. This means that https://example.com and http://example.com are considered separate sites, as are http://www.example.com and http://example.com. If redirects are set correctly, this shouldn’t be an issue (and if they’re not, that could be having an effect on your SEO anyway). In order to crawl a site, it needs to be registered first. This doesn’t actually do much within the API except for normalise the URL, but within the helper script it also sets the current site in the state directory, for ease of operations.

isoxya-api-create-site 
channels (Pro) [null]: 
rate_limit (Pro) [null]: 
url [http://example.com]: 
{
  "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
  "url": "http://example.com:80"
}

Crawls

Now, we are finally ready to do our first crawl! In Isoxya, crawls are modelled as snapshots of the site. The means we just need to POST against a crawl with the desired settings, and the site will be crawled, the raw data sent to the processors, and the processed data broadcast through the streamers to the third-party API or system. Again, an example script takes care of this request for us (remember you can set the ISOXYA_DEBUG environment variable if you want to see how).

isoxya-api-create-crawl 
site.href [/site/aHR0cDovL2V4YW1wbGUuY29tOjgw]: 
agent (Pro) [null]: 
depth_max (Pro) [null]: 
list.href (Pro):
    0: null
    1: 
  [0]: 
  null
pages_max (Pro) [null]: 
processor_config [null]: 
processors.hrefs [/processor/777577e4-4d36-4990-9d17-c5889cbd3652]: 
streamers.hrefs [/streamer/b8373c9a-56b8-465c-9cad-53fb92da2d3c]: 
validate (Pro) [null]: 
{
  "began": "2022-02-07T13:12:58.104457547Z",
  "duration": null,
  "ended": null,
  "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crawl/2022-02-07T13:12:58.104457547Z",
  "pages": null,
  "processor_config": null,
  "processors": [
    {
      "href": "/processor/777577e4-4d36-4990-9d17-c5889cbd3652"
    }
  ],
  "progress": null,
  "site": {
    "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
    "url": "http://example.com:80"
  },
  "speed": null,
  "status": "pending",
  "streamers": [
    {
      "href": "/streamer/b8373c9a-56b8-465c-9cad-53fb92da2d3c"
    }
  ]
}

There might seem like a lot, there, but most of it is optional and requires Isoxya Pro. The main thing of importance is that a JSON object is returned giving an href; this can be used in subsequent calls to determine the status of the crawl. Here we check it using an example script:

isoxya-api-read 
href:
    0: /processor/777577e4-4d36-4990-9d17-c5889cbd3652
    1: /streamer/b8373c9a-56b8-465c-9cad-53fb92da2d3c
    2: /site/aHR0cDovL2V4YW1wbGUuY29tOjgw
    3: 
    4: /site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crawl/2022-02-07T13:12:58.104457547Z
  [4]: 
  /site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crawl/2022-02-07T13:12:58.104457547Z
{
  "began": "2022-02-07T13:12:58.104457547Z",
  "duration": 0.606542453,
  "ended": "2022-02-07T13:12:58.711Z",
  "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crawl/2022-02-07T13:12:58.104457547Z",
  "pages": 1,
  "processor_config": null,
  "processors": [
    {
      "href": "/processor/777577e4-4d36-4990-9d17-c5889cbd3652"
    }
  ],
  "progress": 100,
  "site": {
    "href": "/site/aHR0cDovL2V4YW1wbGUuY29tOjgw",
    "url": "http://example.com:80"
  },
  "speed": 1.648689213844,
  "status": "completed",
  "streamers": [
    {
      "href": "/streamer/b8373c9a-56b8-465c-9cad-53fb92da2d3c"
    }
  ]
}

We can see here that the crawl has made 100% progress and in fact has already completed, that 1 page was detected in the site, and a speed of around 1.6 pages per second was reached. We can also see which processors and streamers were used. This on its own might not be too exciting, though; what about the actual processed data? For that, we turn to the plugin_nginx service, which has by now received data from the streamer:

plugin_nginx_1         | {"url":"http://example.com:80/","retrieved":"2022-02-07T13:12:58.201079239Z","processor":{"tag":"crawler-html","href":"/processor/777577e4-4d36-4990-9d17-c5889cbd3652"},"crawl":{"began":"2022-02-07T13:12:58.104457547Z","href":"/site/aHR0cDovL2V4YW1wbGUuY29tOjgw/crawl/2022-02-07T13:12:58.104457547Z"},"data":{"header":{"Content-Encoding":"gzip","Content-Type":"text/html; charset=UTF-8","Date":"Mon, 07 Feb 2022 13:12:58 GMT","Last-Modified":"Thu, 17 Oct 2019 07:18:26 GMT","Vary":"Accept-Encoding","Cache-Control":"max-age=604800","Server":"ECS (bsa/EB24)","X-Cache":"HIT","Expires":"Mon, 14 Feb 2022 13:12:58 GMT","Content-Length":"648","Etag":"\"3147526947\"","Accept-Ranges":"bytes","Age":"442987"},"method":"GET","status":200,"duration":0.5050255,"error":null},"site":{"url":"http://example.com:80","href":"/site/aHR0cDovL2V4YW1wbGUuY29tOjgw"}}

Crawls revisited

The above is a small example, since http://example.com is a single-page site. To see something more interesting, we turn to https://www.tiredpixel.com (feel free to crawl this yourself if you’re experimenting). All of the processors and streamers can be kept the same, so we just need to register the new site, and create a new crawl.

isoxya-api-create-site 
channels (Pro) [null]: 
rate_limit (Pro) [null]: 
url [http://example.com]: https://www.tiredpixel.com
{
  "href": "/site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz",
  "url": "https://www.tiredpixel.com:443"
}
isoxya-api-create-crawl 
site.href [/site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz]: 
agent (Pro) [null]: 
depth_max (Pro) [null]: 
list.href (Pro):
    0: null
    1: 
  [0]: 
  null
pages_max (Pro) [null]: 
processor_config [null]: 
processors.hrefs [/processor/777577e4-4d36-4990-9d17-c5889cbd3652]: 
streamers.hrefs [/streamer/b8373c9a-56b8-465c-9cad-53fb92da2d3c]: 
validate (Pro) [null]: 
{
  "began": "2022-02-07T13:27:20.814211594Z",
  "duration": null,
  "ended": null,
  "href": "/site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz/crawl/2022-02-07T13:27:20.814211594Z",
  "pages": null,
  "processor_config": null,
  "processors": [
    {
      "href": "/processor/777577e4-4d36-4990-9d17-c5889cbd3652"
    }
  ],
  "progress": null,
  "site": {
    "href": "/site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz",
    "url": "https://www.tiredpixel.com:443"
  },
  "speed": null,
  "status": "pending",
  "streamers": [
    {
      "href": "/streamer/b8373c9a-56b8-465c-9cad-53fb92da2d3c"
    }
  ]
}

If you’re wondering why processors and streamers are arrays, that’s because it’s possible to process and stream crawls in multiple ways simultaneously. For example, it’s possible to use one plugin to crawl a site and another plugin to spellcheck each page, and then to use another plugin to stream the results to Elasticsearch. If we look at Isoxya API, we can see the spiders at work—1 line per page crawler, processed, and streamed:

api_1                  | [Info] /site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz/crawl/2022-02-07T13:27:20.814211594Z CRL GET https://www.tiredpixel.com:443/2020/09/29/isoxya-plugin-crawler-html-20-open-source-preview/ 200
api_1                  | [Info] /site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz/crawl/2022-02-07T13:27:20.814211594Z PRO 777577e4-4d36-4990-9d17-c5889cbd3652 https://www.tiredpixel.com:443/2020/09/29/isoxya-plugin-crawler-html-20-open-source-preview/ 200
api_1                  | [Info] /site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz/crawl/2022-02-07T13:27:20.814211594Z STR b8373c9a-56b8-465c-9cad-53fb92da2d3c https://www.tiredpixel.com:443/2020/09/29/isoxya-plugin-crawler-html-20-open-source-preview/ 202
api_1                  | [Info] /site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz/crawl/2022-02-07T13:27:20.814211594Z CRL GET https://www.tiredpixel.com:443/2018/04/03/skimming-across-the-waters/ 200
api_1                  | [Info] /site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz/crawl/2022-02-07T13:27:20.814211594Z PRO 777577e4-4d36-4990-9d17-c5889cbd3652 https://www.tiredpixel.com:443/2018/04/03/skimming-across-the-waters/ 200
api_1                  | [Info] /site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz/crawl/2022-02-07T13:27:20.814211594Z STR b8373c9a-56b8-465c-9cad-53fb92da2d3c https://www.tiredpixel.com:443/2018/04/03/skimming-across-the-waters/ 202

We can also use the API to see the progress:

isoxya-api-read 
href:
    0: /processor/777577e4-4d36-4990-9d17-c5889cbd3652
    1: /streamer/b8373c9a-56b8-465c-9cad-53fb92da2d3c
    2: /site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz
    3: 
    4: /site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz/crawl/2022-02-07T13:27:20.814211594Z
  [4]: 
  /site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz/crawl/2022-02-07T13:27:20.814211594Z
{
  "began": "2022-02-07T13:27:20.814211594Z",
  "duration": 129.846788406,
  "ended": "2022-02-07T13:29:30.661Z",
  "href": "/site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz/crawl/2022-02-07T13:27:20.814211594Z",
  "pages": 151,
  "processor_config": null,
  "processors": [
    {
      "href": "/processor/777577e4-4d36-4990-9d17-c5889cbd3652"
    }
  ],
  "progress": 82,
  "site": {
    "href": "/site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz",
    "url": "https://www.tiredpixel.com:443"
  },
  "speed": 1.162909008791,
  "status": "pending",
  "streamers": [
    {
      "href": "/streamer/b8373c9a-56b8-465c-9cad-53fb92da2d3c"
    }
  ]
}

This shows us that the crawl is currently 82% complete, and 151 pages have been discovered so far. As the crawl progresses, these figures will change to reflect additional pages. The speed, is settling down to a little over 1 page/s, which is a rate limit applied for safety (this limit is configurable per-site in Isoxya Pro). Finally, we can also see some lines of data being streamed:

plugin_nginx_1         | 172.21.0.3 -  [07/Feb/2022:13:32:08 +0000] "POST / HTTP/1.1" 202 0 "" "" ""
plugin_nginx_1         | {"url":"https://www.tiredpixel.com:443/collection/posts/page/10/","retrieved":"2022-02-07T13:32:08.125682658Z","processor":{"tag":"crawler-html","href":"/processor/777577e4-4d36-4990-9d17-c5889cbd3652"},"crawl":{"began":"2022-02-07T13:27:20.814211594Z","href":"/site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz/crawl/2022-02-07T13:27:20.814211594Z"},"data":{"header":{"Content-Type":"text/html","Date":"Mon, 07 Feb 2022 13:32:08 GMT","Last-Modified":"Sun, 06 Feb 2022 16:36:39 GMT","Strict-Transport-Security":"max-age=15724800; includeSubDomains","Content-Length":"27779","ETag":"\"61fff917-6c83\"","Connection":"keep-alive","Accept-Ranges":"bytes"},"method":"GET","status":200,"duration":4.5672815e-2,"error":null},"site":{"url":"https://www.tiredpixel.com:443","href":"/site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz"}}
plugin_nginx_1         | 172.21.0.3 -  [07/Feb/2022:13:32:09 +0000] "POST / HTTP/1.1" 202 0 "" "" ""
plugin_nginx_1         | {"url":"https://www.tiredpixel.com:443/2012/11/14/constant-hum/","retrieved":"2022-02-07T13:32:09.17470377Z","processor":{"tag":"crawler-html","href":"/processor/777577e4-4d36-4990-9d17-c5889cbd3652"},"crawl":{"began":"2022-02-07T13:27:20.814211594Z","href":"/site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz/crawl/2022-02-07T13:27:20.814211594Z"},"data":{"header":{"Content-Type":"text/html","Date":"Mon, 07 Feb 2022 13:32:09 GMT","Last-Modified":"Sun, 06 Feb 2022 16:36:39 GMT","Strict-Transport-Security":"max-age=15724800; includeSubDomains","Content-Length":"30263","ETag":"\"61fff917-7637\"","Connection":"keep-alive","Accept-Ranges":"bytes"},"method":"GET","status":200,"duration":4.5782555e-2,"error":null},"site":{"url":"https://www.tiredpixel.com:443","href":"/site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz"}}
plugin_nginx_1         | 172.21.0.3 -  [07/Feb/2022:13:32:10 +0000] "POST / HTTP/1.1" 202 0 "" "" ""
plugin_nginx_1         | {"url":"https://www.tiredpixel.com:443/2012/11/26/you-play-jazz/","retrieved":"2022-02-07T13:32:10.223593991Z","processor":{"tag":"crawler-html","href":"/processor/777577e4-4d36-4990-9d17-c5889cbd3652"},"crawl":{"began":"2022-02-07T13:27:20.814211594Z","href":"/site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz/crawl/2022-02-07T13:27:20.814211594Z"},"data":{"header":{"Content-Type":"text/html","Date":"Mon, 07 Feb 2022 13:32:10 GMT","Last-Modified":"Sun, 06 Feb 2022 16:36:39 GMT","Strict-Transport-Security":"max-age=15724800; includeSubDomains","Content-Length":"30296","ETag":"\"61fff917-7658\"","Connection":"keep-alive","Accept-Ranges":"bytes"},"method":"GET","status":200,"duration":4.4647877e-2,"error":null},"site":{"url":"https://www.tiredpixel.com:443","href":"/site/aHR0cHM6Ly93d3cudGlyZWRwaXhlbC5jb206NDQz"}}

Unlike some other crawlers, with Isoxya results arrive almost instantaneously—i.e. there is no ‘crawl finalisation’ stage. That means that even for very large sites, you can typically start consuming the data within around just one second (!), and each page will flow through the engine and plugins and arrive through the streamers until completion.

Summary

Isoxya API redefines flexibility when it comes to crawling and scraping. Instead of being single-purpose, it supports an ecosystem of plugins to define and extend core functionality, which can written in any of a number of programming languages. The open-source nature of both Isoxya API and various example plugins allows you to use, examine, adapt, or replace individual components according to need or whim. Since both the API and the plugins operate over well-defined JSON interfaces, it’s possible to start off using Isoxya and later scale using Isoxya Pro if needed.