This post was originally published on the website of Pavouk OÜ (Estonia). On 2020-06-12, I announced that Pavouk OÜ was closing. The posts I wrote have been moved here.
We’re pleased to announce the release of Isoxya 1.2—the high-performance internet data processor and web crawler. This release introduces partial snapshots limited by maximum pages or maximum depth, user agent whitelists, a more efficient approach to multi-snapshot settings, and numerous optimisations and fixes.
Maximum Pages
In the interests of data completeness, we encourage taking full snapshots wherever possible. Sometimes, however, you only need a small preview, or maybe to test only a few pages. To support this, this release introduces the concept of limiting by maximum pages, set per-snapshot. Where this setting is used, site crawling will stop soon after the limit is reached, data-processors will complete their pending tasks, and the snapshot will be marked as limited so it’s clear that it doesn’t show the full picture.
{
…
"pages_max": 10000,
…
}
Maximum Depth
Similar to the introduction of maximum pages, this release introduces the concept of maximum depth, limiting how many hops away will be followed. When a boundary page is processed, any subsequent links from that page won’t be followed. This is not necessarily the same as a shortest-path graph calculation, and by default, this is also not necessarily the same as number of clicks, since usually browsers silently squash redirects. It can be very useful for reconnaissance before a complete snapshot.
{
…
"depth_max": 10,
…
}
User Agents
User agents are now fully-supported at the organisation level. Since Isoxya is not like many other crawlers, user agents are always required, in the interests of respecting the site owners and following robots.txt directives. Rather than having to set the user agent on every snapshot, user agents are now whitelisted per-organisation, and each snapshot now requires a user agent to be specified.
{
"href": "/org_user_agent/33d29d16-8e01-11e9-8001-0242ac180006",
"org": {
"href": "/org/c9fcdeab-8787-11e9-8001-0242ac150002"
},
"user_agent": "Isoxya (bot@pavouk.tech)"
}
Maximum Cache Age
The maximum cache age setting controls the recency of data crawled and processed in a snapshot. At first sight, you might think that you would always want the absolute latest data; however, this might not always be the case; for large crawls, the time between when a crawl started and completed can be very large, and sometimes it’s desirable to be able to reprocess data efficiently without accessing the pages multiple times from the site. This setting is now set once at an organisation level, but it’s worth keeping in mind that if you crawl the same site twice within the same interval, the site won’t be accessed by the duplicate crawl. This means we’re able to remove the rate-limit entirely for such snapshots, as well as support advanced features like continuous crawling or predictive fetching of scheduled data.
{
"cache_age_max": 7776000,
…
}
Minimum Response Duration
Our experience is that many crawlers go far too fast, especially for small sites, and that the choice of which rate-limit is appropriate can be befuddling—or, perhaps, the temptation to get the data quicker, too great. Isoxya takes the unprecedented approach of not allowing rate-limits to be set directly; instead, we apply what we consider to be sensible defaults, and try to take into account any reasonable requests made to us by site owners. These settings apply across our entire network, not just for a single crawl, and also allow us to set different rate-limits for different related sites (e.g. for CDNs).
{
"chans": 1,
…
"res_duration_min": {
"denominator": 1,
"numerator": 1
},
…
"url": "http://example.com:80"
}