Getting data out of vendor-managed systems for analytical flexibility

Data Warehousing

Context

Our client was a family entertainment operator in US and Canada with annual revenue of around $80M. It delivered services to consumers in a large number of retail locations, and payments were accepted through various vendor-managed POS systems. As a result, all data generated by POS transactions remained in the vendors' systems, severely limiting the client's access to it, as they could only retrieve data through proprietary report generators

These generators offered limited visualization options, and users could only download a few days' worth of raw transaction data at a time due to its size. Furthermore, there were a number of advanced analyses that the management team wanted to run, including pricing elasticity and asset utilization, that would require direct access to 3-4 years of raw transactions data and running them through SQL or Python algorithms.

Objectives

The client asked us to set up a data warehouse as the repository of all their transactions data that are currently sitting in various vendor-managed systems. This data warehouse would become the primary source for all analytics and reporting going forward.

Project delivery

We decided to use Databricks to handle data replication for this client, primarily due to its flexibility in handling different data sources — particularly live streaming which one of the vendors proposed to do for its transactions — and the ability to use Python for complex data transformations.

Due to having multiple vendor systems, we took a phased approach, one per vendor. The data ingested after each phase could immediately be used to generate useful analytics.

For the first vendor, whose system manages all the non-card transactions, the vendor extracted CSV files weekly and used an SFTP client to upload them into our client's Azure Blob storage account. We then used Databricks to pick up the files and orchestrate a series of Python Notebooks to execute the delta load logic. A complicating factor was the way this vendor's system closes the data each month (similar to how accounting teams close the books each month), requiring the creation of two separate pipelines, one for the current period and another for the prior period, the transition of which happens 4-8 days after the end of a calendar month.

Outcome

We turned the data warehouse on after the pipelines for the first vendor were deployed and tested. Then at the client's request we used the data warehouse to create an automated dashboard that has since alleviated 2 days per month in workload that fell on the company's main analyst (see this case study).

We are currently in the phase 2 of the project. The completion of this phase will allow the management team to run some of the more advanced analyses that they were looking to do.

‍