I’m pleased to announce Isoxya 1.5—a web crawler and data processing system, representing years of research into building a next-generation web crawler. This release is focussed on performance, implements a new resource management system, and moves most disk operations into memory. I’m now looking to find industry partners who would like to work with me to develop plugins and products on top of the Isoxya API.
Isoxya can process websites with tens of millions of pages, and extract and transform that data in myriad ways, including streaming data into Elasticsearch. Taking web crawling back to first principles, I’ve iteratively built what I believe to be one of the most scalable and fast crawlers around, with a flexible plugin system able to interface with pretty much any modern programming language. Able to work with SEO, scientific literature search, spellchecking, data mining, machine learning, and so much more, I’m not aware of anything quite like it on the market.
expanded API
Isoxya is controlled via a REST API using JSON, which is able to start crawls and monitor their progress. Built on top of Tigrosa, a high-performance authentication and authorisation proxy I developed specially, account management is split into organisations and users, and supports multi-account, multi-user access. This release expands the API by embedding common subresources such as organisation and site, increasing usability whilst decreasing round-trips.
new resource management system
Isoxya allocates dedicated resources for all web spiders and plugins. Whilst this has overall been working well, a round of real-world testing on the live internet showed that this system struggled with allocating new resources whilst cleaning up unused resources. These operations have now been redesigned in a new resource management system, which is able to run these tasks in parallel.
optimised external link validation
To support the SEO use case, Isoxya is able to validate external links found within a site. The way it does this is innovative: during a crawl, it uses the site lists feature to built up multiple sets of URLs to be checked, one per external site. Once the crawl has completed, new depth-1 crawls are launched for each of these, supplementing the data which has already been streamed live to external services. Since Isoxya is a distributed crawler, it’s able to run these secondary crawls in parallel, applying any site-specific rate-limits independently. Previously, this external link validation process needed manual intervention, but in this release it’s been fully automated.
disk operations into memory
Over the years, Isoxya has grown incrementally, and much of the approach used in Isoxya 1.0 has improved dramatically. The initial design hinged around the idea of caching multiple layers, requiring storage of intermediate operations on disk. With what Isoxya has grown into, it seems that this feature will not be as useful as initially hoped, since most of the time up-to-date data is what’s required, and partial updates and historical archives are easily handled by external systems. As one of the largest performance optimisations made in the history of Isoxya, most disk operations have now been moved into memory, reducing contention and decreasing overall latency, without sacrificing resilience.
industry partners
I’m now looking to find industry partners who would like to work with me to develop plugins and products on top of the Isoxya API. If this sounds like something you might be interested in, you can get in touch with me via the links in the footer. Don’t forget also to subscribe to my newsletter, to hear the latest about the various projects I’m working on, such as those listed on my blog.