This post was originally published on the website of Pavouk OÜ (Estonia). On 2020-06-12, I announced that Pavouk OÜ was closing. The posts I wrote have been moved here.
We’re pleased to announce the very first release of open-source Isoxya plugin: Crawler HTML 1.0—providing static HTML crawling to SEO and other internet data-processing activities. This discovers the site graph as quickly as possible, leaving data-extraction or more costly operations to other plugins working in parallel. Using this in combination with the proprietary Isoxya engine, it’s possible to crawl entire websites, even if they have millions of pages, and process them in myriad ways. Docker images are available, and similar to Isoxya plugin: Spellchecker, released a few days ago, we’ve decided to release the plugin open-source (BSD-3 licence).
Lightweight crawler
When building crawlers, it can be difficult to find the right balance between complexity and simplicity. Without a certain base set of features, the crawler would be of minimal use, since the data extracted from it would be too limited or the engine not advanced enough to be close to how a search-engine works ‘in the wild’. But every additional feature makes it harder to analyse whether its operation is correct, and carries with it the usual problems of increased maintenance cost. Isoxya solves these problems by separating the core engine from how pages are processed, and separating that further between operations used to control the crawl and operations used to extract or analyse data. This plugin provides a lightweight set of features allowing many of the pages of a site to be detected and traversed, without industry-specific data-extraction bloat. For example, this plugin allows the Isoxya Pickax Spellchecker to crawl millions of pages without having to have any SEO-specific features in the loop.
Features
Link extraction
Links from the page are extracted, deduplicated, and are fed back into Isoxya to control the crawl.
<a href="http://www.iana.org/domains/example">More information...</a>
Header redirects
Header redirects are extracted, allowing Isoxya to resolve redirects which happen within a site. These are not flattened automatically, allowing for more complex types of analysis such as checking how many hops a browser has to make in order to load a page.
HTTP/1.1 301 Moved Permanently
Location: https://www.pavouk.tech/
No-follow links
No-follow directives within page links are respected, similar to how Google discounts such links when calculating PageRank.
<a href="http://www.example.com/" rel="nofollow">Link text</a>
Base tags
Base tags are read, allowing for override of the base URL against which relative URLs are resolved. These are then sent to Isoxya as absolute URLs.
<base href="http://www.example.com/">
Meta robots no-follow tags
Meta tags giving no-follow directives are respected, resulting in every link on the page being no-followed.
<meta name="robots" content="nofollow">
Header X-Robots-Tag no-follow
Headers giving no-follow directives are also respected, resulting in every link on the page being no-followed.
X-Robots-Tag: nofollow