Isoxya plugin: Crawler HTML 2.0 open-source preview

2020-09-29 · Isoxya

A couple of weeks ago, I announced my plans for Isoxya 2, and a planned split between the existing closed-source ‘enterprise’ edition, and a new open-source ‘community’ edition. I’m intending for the community edition to make it possible to perform small single-node crawls, as well as to make it easier to develop Isoxya plugins using your programming language of choice. These same JSON interfaces will then be compatible with the large-scale, distributed, high-availability enterprise edition. True to that goal, I’m pleased to announce Isoxya plugin: Crawler HTML 2.0 open-source (BSD 3-Clause) preview.

what is it?

Isoxya plugin: Crawler HTML uses Isoxya 2 JSON interfaces to provide a core run loop for the crawling engine, receiving data for each page post-request, parsing it as static HTML, constructing URL metadata, and responding with a set of outbound URLs.

However, since Isoxya supports both processor and streamer plugins using the Isoxya interfaces, it’s not actually necessary to use this plugin at all, opening up the possibility of more complex usages such as extracting data from individual pages rather than actually crawling, or writing an alternative run loop altogether.

what does it support?

  • links parsed <a href="http://example.com">link</a>

  • header redirects extracted Location; HTTP Status 301, 302, 303, 307, 308

  • no-follow links respected <a href="http://www.iana.org/domains/example" rel="nofollow">

  • base tags used for relative links <base href="http://www.example.com/">

  • meta robots no-follow tags respected <meta name="robots" content="nofollow">

  • header X-Robots-Tag no-follow respected X-Robots-Tag: nofollow

is it ready?

Not quite yet, no; this is a preview of what is planned for this plugin in Isoxya 2. However, it’s based on stable Isoxya 1.5, which has reached a consistent interface, and which has already been used for crawling many live sites on the internet. There might be a couple of small changes to the JSON interfaces coming as work on Isoxya 2 progresses—likely just additional keys in JSON objects or small renames—but overall the preview code released today should give a pretty good idea of what it’s going to look like.

It’s worth mentioning that the other type of plugin, for streaming data in Isoxya, is separate to this. An open-source preview of a plugin for streaming data into Elasticsearch is planned to follow shortly. I have loads of code left to work through, which will be updated, documented, and partly open-sourced as I go. Stay tuned!