This post was originally published on the website of Pavouk OÜ (Estonia). On 2020-06-12, I announced that Pavouk OÜ was closing. The posts I wrote have been moved here.
Since releasing open-source Isoxya plugin: Elasticsearch 1.1 in September 2019, we’ve improved our design to better support multiple organisations and large amounts of crawling data. We’re pleased to announce version 1.2—changing our indices structure to improve our usage of Elasticsearch, upgrading to the latest Elasticsearch 7.6, and updating various libraries.
Time-based indices
Although we’ve previously utilised time-based indices, these were namespaced by site snapshot. Whilst fine for large sites, such as those we’ve been testing with millions of pages, for small sites, with just a few hundred or thousand pages, this resulted in inefficient usage of Elasticsearch indices with only a few documents per index. We’ve re-namespaced indices to include multiple sites in the same indices, whilst still maintaining a time-based component, allowing for efficient deletion or storage class changes using Elasticsearch index lifecycle policies.
Organisation-scoped indices
As part of the re-namespacing of indices, we also realised that scoping by site snapshot wasn’t the most convenient for multi-tenant storage clusters. Although for private Elasticsearch clusters this wouldn’t be a problem, agencies or other intermediate data providers might well want to give access to only certain data to each organisation. Because Isoxya doesn’t place specific restrictions on the numbers of sites possible to crawl (unlike some other crawlers which restrict by number of ‘projects’ or similar), including the site in the namespace resulted in lots of rules being needed to authorise each index pattern. By moving to organisation-scoped indices, it’s now possible to use Kibana role-based access control to authorise all indices for an organisation with a single wildcard pattern, reducing complexity.