This post was originally published on the website of Pavouk OÜ (Estonia). On 2020-06-12, I announced that Pavouk OÜ was closing. The posts I wrote have been moved here.
We’re pleased to announce open-source Isoxya plugin: Elasticsearch 1.1—a plugin which streams data from the Isoxya crawler to the Elasticsearch database. This release adds various metadata from Isoxya data-processing plugins, extends the metadata through Isoxya data-streaming plugins, as well as using Elasticsearch Bulk API to optimise data-streaming from pages generating large numbers of documents (e.g. Isoxya plugin: Spellchecker). Docker images are available, and the plugin is open-source (BSD-3 licence).
Isoxya Pickax metadata
We’ve extended the metadata provided from Isoxya data-processing plugins through the ‘droplet’ data structure, to add the following:
org_pick.href
: href of the organisation pickaxorg_pick.tag
: tag of the organisation pickax; e.g.spellchecker
site.href
: href of the sitesite.url
: URL of the site; already present as part of theurl
absolute URL, but provided for easier queryingsite_snap.t_begin
: time the site snapshot begant_retrieval
: replaces the previoust_retrieved
, merely renamed
Improved type detection
Most data-processing plugins result in one Elasticsearch document per page. However, it is possible, and indeed helpful, for some data structures to generate more than one document, particularly in the case where the structure results in an array of objects. Attempting to load this into Elasticsearch and querying via Kibana shows warnings, as it can’t be indexed properly. For data coming from Isoxya plugin: Spellchecker, a document is generated for each spelling suggestion. Previously, this was detected by deeply inspecting the JSON supplied via the ‘droplet’ datatype; with the new org_pick.tag
, however, this is no longer necessary, so documents are mapped more clearly and efficiently.
Elasticsearch Bulk API
Because some data-processing plugins result in multiple Elasticsearch documents per page, use of the Elasticsearch Index API directly could result in a lot of network requests. For Isoxya plugin: Spellchecker, if there were 50 spelling suggestions on a page, this would result in 50 separate network requests being made to Elasticsearch as part of the data-streaming for that page ‘droplet’. Obviously, this is far from ideal. This ‘pipeline’ now uses the Elasticsearch Bulk API, instead, which loads all the data from one page in a single network request, regardless of how many documents it generates. Not only is this far quicker, this is also gentler on the Elasticsearch database.
As part of this work, the document ID mechanism was also upgraded, to generate a consistent ID per document, rather than leaving it to Elasticsearch to create automatically. This makes the ‘pipeline’ far more resilient to certain types of failures, since even interrupted network requests—which reprocess automatically after networking is restored—shouldn’t result in duplicate documents being created. The scope for this, like for everything in Isoxya data-streaming, is at the site snapshot level—i.e. data is supplied for a page once within the snapshot, and not again (noting that a subsequent snapshot might use the same underlying page data).