Isoxya plugin: Elasticsearch 2.0 open-source preview

2020-10-01 · Isoxya

Hot on the heels of this week releasing an open-source preview of Isoxya plugin: Crawler HTML 2.0—a processor plugin for Isoxya web crawler—I’m pleased to announce Isoxya plugin: Elasticsearch 2.0 open-source (BSD 3-Clause) preview. This is the first preview of a streamer plugin using the JSON interfaces planned for Isoxya 2, intended to be compatible with both the open-source ‘community’ and closed-source ‘enterprise’ editions.

You can also take a look at how the new Isoxya plugins page is developing, with examples of the JSON payloads and links to the source code.

what is it?

Isoxya plugin: Elasticsearch uses Isoxya 2 JSON interfaces to stream data into an Elasticsearch cluster, making it possible to query using all the normal features provided by Elasticsearch and Kibana. Using this, it’s possible to build entirely new applications on top of Isoxya—including search engines and SEO products!

However, since Isoxya supports both processor and streamer plugins using the Isoxya interfaces, it’s not actually necessary to use this plugin at all, opening up the possibility of streaming to different datastores such as PostgreSQL, Apache Hadoop, or AWS Redshift instead—or even not persisting the data at all, and connecting it to a web app using WebSockets, or alternatively to some alerting system.

what does it support?

  • index auto-creation using organisation and date isoxya.f9b4a163-36a8-4b25-8958-d58e52a1a5bd.2019-05-01

  • insert using Elasticsearch Bulk API Content-Type: application/x-ndjson

  • deterministic auto-generated document ids 9c8100c7642a06acc892c9696e55789ec0dd67ad0dee06a5c378343b5e47a969.1

  • one-to-many support for crawled pages which result in multiple documents, based on plugin tag org_proc.tag: spellchecker

  • document metadata for position within sequence data_i, data_n

  • variable request limit REQ_LIM

is it ready?

Not quite yet, no; this is a preview of what is planned for this plugin in Isoxya 2. However, it’s based on stable Isoxya 1.5, which has reached a consistent interface, and which has already been used for crawling many live sites on the internet. There might be a couple of small changes to the JSON interfaces coming as work on Isoxya 2 progresses—likely just additional keys in JSON objects or small renames—but overall the preview code released today should give a pretty good idea of what it’s going to look like.

There is some more plugin code to work through, and it will soon be time to start going through the Isoxya API itself, splitting out from the existing Isoxya 1.5 enterprise edition, updating, documenting, and partly open-sourcing it to create the Isoxya 2.0 community edition. Stay tuned!