Big data analytics for SCM and Asset monitoring data is handled by the DataX. SCM Analytics uses this framework and has multiple apps developed to generate hundreds of metrics across thousands of dimension combinations. These apps are run at different frequencies based on the requirements. Listed below are the analytics apps being run on this platform. Compute cluster and analytics storage are shared with all the apps.
Master data analytics
Big data exports
Event summary analytics
DataX is a data platform orchestrating Spark jobs and with primary choice for compute cluster is Yarn Hadoop cluster. This tool provides abstractions on top of Spark to simplify the development of spark jobs. Spark jobs can be curated to directed acyclic graph using Job groups, Pipelines, Meshes and Apps. This platform is generic that supports running any type of big data workload including analytics, exports, and event summarisation. At the core this is an ETL tool, with prebuilt tools to extract data from sources, transform using SparkQL or Scala classes, aggregate and load the data in any data store. The primary components of this platform are Application manager, Spark cluster, and Storage.
A play application with RESTful and Messaging interface to schedule, execute DataX apps. This is primarily responsible for running through the configured apps components and trigger execution in the order defined in the specification. It tracks and manages the state of the run in a database of it’s own (MongoDB).
Meshes can be triggered using one of the following ways.
- Ad-hoc/On need basis :- Provides a REST end point to fire a App, that accepts the job configuration as POST body.
- Accepts job requests via messaging (ActiveMQ).
- Scheduled :- Cron style scheduling that triggers the jobs as scheduled.
- Also allows dynamic configuration generation to customise the period for which next aggregation should run. For e.g. :- If aggregation job has to run hourly, It typically might need to pick up data that was created or changed in the last hour. Mesh executor will pass the period form previous run to the scheduled time.
- This behaviour can also be overridden by the App itself.
Analytics Storage (Cassandra)
Analytics generates time series data. Aggregated data refers to a metric at a specific instance in time. Typical SQL databases are not efficient for storing columnar data. Many No SQL databases are optimised for storing and retrieving time series data, and also can scale linearly. Cassandra is one such database which is AP system (Availability/Partition tolerance) and is very efficient in time series storage. It internally uses String sorted tables which are optimal. It is masterless distributed database inspired from Dynamo. This supports time series data retrievals usually in the order of tens of milli seconds. It had been proven to scale to serve millions of requests per second for both write and read operations. It also provides out of the box support for Cross data centre replication between Cassandra clusters.