I’m pleased to announce the release of Isoxya web crawler 2.1, an internet data processing system representing years of research into building next-generation crawlers and scrapers. Back in September 2020, I announced that a major new version of Isoxya was coming, and that it would be splitting off part of the commercial edition into a new open-source edition. Isoxya 2.0 was released in January 2021, along with 3 open-source plugins. Today’s release of Isoxya 2.1 completes delivery of that milestone, and for the very first time, introduces an open-source mini crawler with minimal dependencies—and usable absolutely free!
Community and Pro Editions
Isoxya now comes in two editions: Community Edition (CE), a free and open-source (BSD 3-Clause) mini crawler, suitable for small crawls on a single computer; and Pro Edition (PE), a commercial and closed-source distributed crawler, suitable for small, large, and humongous crawls on high-availability clusters of multiple computers. Both editions utilise flexible plugins, allowing numerous programming languages to be used to extend the core engine via JSON interfaces. Plugins written for Isoxya CE should typically scale to Isoxya PE with minimal or no changes. More details and licences are available on request.
Features
Feature | Community Edition (CE) | Pro Edition (PE) |
---|---|---|
Licence | open-source | commercial |
API | ✓ | ✓ |
CLI scripts | ✓ | ✓ |
Plugins | 3+ | 3+ |
· | ||
Authentication | ✗ | Tigrosa |
Database | SQLite | PostgreSQL |
Cache | ✗ | Redis |
Message broker | ✗ | RabbitMQ |
· | ||
High-availability | ✗ | ✓ |
Horizontal scaling | ✗ | ✓ |
Error recovery | ✗ | ✓ |
Resource management | ✗ | ✓ |
· | ||
Concurrent crawls | 1 | ∞¹ |
Pages/crawl | ∞²⁺³ | ∞¹ |
User-agents | 1³ | ∞¹ |
Rate-limit (reqs/s) | 1/10³ | ∞¹⁺⁴ |
· | ||
Robots.txt | ✗ | ✓ |
Crawl max pages | ✗ | ✓ |
Crawl max depth | ✗ | ✓ |
List crawls | ✗ | ✓ |
External link check | ✗ | ✓ |
Crawl cancellation | ✗ | ✓ |
Organisations | ✗ | ✓ |
· | ||
Crawler channels | 1 | ∞¹ |
Processor channels | 1 | ∞¹ |
Streamer channels | 1 | ∞¹ |
· | ||
OS variant | Linux | Linux |
Packaging | container | container |
Support | community | direct |
· | ||
Price | free | on request |
Features and limits are indicative only, not guarantees. ∞ indicates many, not infinite! ¹ depending on licence and infrastructure. ² no hard-limit, but small as single-process. ³ not configurable. ⁴ set globally per-site; configurable for on-prem only.
Availability
Open-source programs have been published to GitHub and Docker Hub: Isoxya CE (code, download), and the Crawler HTML (code, download), Elasticsearch (code, download), and Spellchecker (code, download) plugins. In addition, there are example scripts available, either as a demo, or for use on a CLI. Documentation for the Isoxya API and interfaces is also available. There’s a hands-on walkthrough, and a short video. If you’re interested in Isoxya PE, more details and licences are available on request.