Multisite Processing

Multisite processing starts where standard processing in one site ends. The following sections describes the supported multisite processing concept and the deliverable milestones to implement it.

Concept

Overview

Logical

In short the concept behind multisite operations is to collect all final process package data sets of all Access Layers in one GRID site, write them to compressed binary logs and ship these log packages incrementally to all other GRID sites.

Diagram:

Overview Diagram

Technical

The technical details of the concept are:

  1. Access Layer nodes receive <processPackageDataSet/> elements via a SOAP call to the service method storeProcessingResultsAndFinalizeJobs(...).

    See "Processing Elements" for more information on this.

  2. Without multisite operations these elements are persisted inside the DB and therefore an Access Layer is already capable of creating relevant DB calls on these packages, including conflict handling and de-duplication.

    With enabled multisite operations the content is additionally shared with all other running Access Layer instances using reliable multicast (using the same connections as also used by the distributed cache) and then logged into a ZIP package (on every connected ACL instance).

  3. Every X minutes a ZIP package is labelled final and made available for transfer inside the static files section of every ACL instance, using the standard HTTP protocol.
  4. After the package was transferred to the target locations a small agent is now capable of reading the ZIP packages and replaying the stored <processPackageDataSet/> elements against a local Access Layer.

    As every <processPackageDataSet/> element has a "processingSite" URI, the Access Layer will NOT log data sets that are originating from another GRID site. Therefore no endless message bouncing will happen under the assumption that the processingSite URIs differ on multiple GRID sites.

Multisite Scope

The scope of multisite operations is to transfer the processing results of the GRID including fresh processed content and re-processed content. When transferring this information, the implementations must be intelligent enough to avoid conflicts and duplications.

The actual file content behind the processing results is never subject of multisite replication. If there's a need to transfer this sort of content, then this simply has to be done as a manual transfer (taking into account that licensing restrictions may apply to such an exercise)

The ZIP package

When collecting many <processPackageDataSet/> elements there needs to be a way to differentiate between them. Multiple options can be considered like: Using a XML wrapper document that contains the elements; Storing the elements in a custom chunked stream or finally using the file system to store every element in a separate file.

As one aim is to reduce resource requirements as much as possible, the single large document option is not useful. Creating a chunked stream can optimize resource usage but it requires a custom format definition. Using single files seems to the the best option but has one drawback when combined with a single ZIP package as ZIP has limitations and does not support solid compression (this means that the backing dictionary is not shared between files, leading to worse compression).

As many values inside the <processPackageDataSet/> elements are always the same, solid compression is mandatory to realize small transfer package sizes. Because of this the following approach is used:

  • Every <processPackageDataSet/> element is stored as a numberd file inside an uncompressed ZIP file called "section-[number].zip". Multiple section files are created if a threshold of files is reached (e.g. 100 000, this is to avoid ZIP file count limitations)
  • One or more "section-[number].zip" files are stored in a compressed "coredb-updates-[timestamp].zip". Because the "section...zip" files are uncompressed they will effectively be compressed with a so called solid compression, leading to much smaller ZIP packages.
  • "coredb-updates-[timestamp].zip" follows the packaging rules of "static" content that is also offered by the GRID. This means it can optionally be encrypted and multiple packages can define links to each other to build package chains.

Important Note

The resulting ZIP package is fully standards compliant and can be hand-crafted for testing purposes. The concept does not advertise to use any sort of propritary standards.

Example Structure

coredb-updates-2010-05-21T13:45:00.zip
        META-INF
                manifest.mf
                        requires-packages: coredb-updates-2010-05-21T13:30:00.zip, ...

        section-00001.zip
                00000001.xml
                00000002.xml
                00000003.xml
                ...

        section-00002.zip
                00000001.xml
                00000002.xml
                00000003.xml
                ...

        ...

Miltestones

MM1: Basic Replication

Basic replication requires that a single Access Layer instance is capable of writing a new ZIP package every "X" minutes and dump it into a defined output directory.

In addition a commandline line agent exists that can be used to replay the ZIP packages against an Access Layer instance.

If multiple Access Layer instances exist and are creating packages, every instance will need to write the output into a separate folder and all packages need to be collected, transferred and replayed using the agent (the replay can be done against one Access Layer instance).

MM2: Failsafe Basic Replication

Extends basic replication by the option to link multiple access layer instances using the distributed cache connections and therefore ensure that the incoming <processPackageDataSet/> element are evenly shared amongst all instances.

This has 2 advantages:

  1. It's no longer required to collect packages from all Access Layer instances as all ZIP packages will be the same.
  2. If an access layer instance dies, the other still contains the content that is scheduled for replication.

MM3: Automated Replication

Automated replication extends the second milestone by running an additional agent that automatically keeps track of changes and transfers all changed packages from all other GRID sites that it knows.

Afterwards it will automatically call the replay agent. Implementing this concept allows to replicate GRID sites with no manual interaction.

MM4: Automated & Secured Replication

Secured replication involves the signing and encrypting of the packages to ensure that the information is really coming from a valid GRID site.

This involves the Access Layers which need to be capable of sign and encrypt the packages and also the replay agent which needs to be able to validate and decrypt the content.