A Journey Towards Bare-Bones HR Analytics Infrastructure  | AIHR Analytics


It goes without saying that the role of data in the HR function has grown substantially in the last decade, and will continue to do so for decades to come. In an ideal world, an HR professional should be able to access data that are:

  1. Fresh (recently updated)
  2. Clean (consistent application of definitions)
  3. Explorable (able to be filtered and pivoted to the specific use case)

This sounds reasonably simple, but when you take into account legacy data entry processes, inconsistent naming conventions, siloed systems, as well as the strong need to protect employee privacy, building HR data infrastructure is not for the faint of heart. This piece will take you through my journey in this space, which I hope will help others who decide to tread a similar path.

The Starting Line

Like many others, I was brought in to run a people analytics function for a company that wanted to be more data-driven with respect to HR. I was under the impression that all the data were in a nice relational database, and was excited to be in a place where I could tear into the data and start providing actionable insights.

Where I found myself was somewhere quite different – as I’m sure many other HR analytics professionals can identify with. I joined an HR organization that lived primarily in manual Excel reports from siloed systems with varying integrity. It was at that time that I realized that the starting line for my journey was a good amount further behind than I thought it was.

An Interim Solution

Fortunately, I was blessed with a lot of flexibility to build a solution. As I started to think, I realized I needed to build an interim fix before I could launch into building a larger one. As someone with some background in Python, R, and Tableau, I started building something that was a mixture of manual effort and automation:

51 HR MetricsCheat Sheet

Data-driven HR starts by creating and implementing a set of relevant HR metrics that help you determine the efficiency and impact of the workforce and HR department.

Download the FREE cheat sheet with 51 HR Metrics

Download free pdf

The Interim Solution consists of the following layers:

  • Collection Layer: Depending on the data source, collected data either through a manual excel report or through an API run.
  • Staging Layer: Cleaned and restructured the collected data into a more usable format using set Python or R scripts.
  • Visualization Layer: Uploaded the staged data to Tableau, where set logic augmented and visualized the data in a specified way.

Early Attempts at a Longer-Term Solution

Clearly the interim solution specified above was better than trying to do everything manually in Excel, but it was far from perfect. While the amount of manual effort in processing and visualizing the data was greatly reduced, this solution required manual effort (namely, file downloads and uploads) to have the data transfer from one layer to the next.

That friction motivated consistent effort in building a longer-term solution. There were a couple of products out there that were “one-stop shops”; namely, they handled all of the data infrastructure, cleaning, and visualization for you.

One such product was One Model. While it was similar to other products in that it had the ability to set up both the data infrastructure and visualization, they could do just the data infrastructure as well. This was attractive to me, as it would allow me to apply various “cleaning” methods that were needed for my organization’s data before visuals hit the end user.

A Door Closes

Given the flexibility with One Model, I ended up writing a proposal featuring it, along with another couple of products, for one of the big decision-makers at my organization. A few days before I was about to send it, though, the working world threw me a curveball: it was announced that my company would be acquired.

Download Syllabus

While this was exciting on a lot of levels, this change forced me to reevaluate my proposal. I had built something on paper that I believed would be an incredibly good solution for the needs of my organization. The problem was that, given the acquisition, those needs could very well change by the time it was implemented. In the end, I concluded that I needed to table building the data infrastructure until I understood the organization’s future needs better.

A Window Opens

A few months after the acquisition was complete, I decided I was ready to go back to the drawing board. While I closed the door on my last solution due to the acquisition, it also opened up new opportunities. I was able to chat with the people analytics team at the new parent company, and while their own solution was much more robust due to their size, they guided me towards some novel ways to approach the problem.

One product I started exploring was Azure Data Factory / Pipelines. It was attractive for me to be able to schedule my API data pulls, use virtual machines to do the heavy lifting, and get all of my data into a relational database. Unfortunately, I found these products required a lot more configuration than what made sense for my simple use case.

I then happened upon a product called Databricks. It has a lot of similar functionality to the products mentioned above, but with a cleaner user interface and easier configuration. It also had the opportunity to host databases locally, which was exciting for me. Unfortunately, I learned that the way it hosted databases wasn’t compatible with the other tools I wanted to use.

A Tentative Solution

Despite the database issues, Databricks still had nice scheduling functionality, along with the ability to spin up virtual machines on-demand to do the work. As such, I decided to feature it in my tentative solution:

The Tentative Solution consists of the following layers:

  • Collection Layer: Data are collected from all data sources/tables via API using Databricks
  • Staging Layer: Data are cleaned and restructured using Python and R, then dumped into a SQL server, all within the same Databricks script.
  • Visualization Layer: Tableau Online queries the SQL server on a daily basis, then applies set logic to augment, visualize, and control access to various streams of data.

Lessons Learned

While this is not perfect, this process works much better than the interim solution noted above. I have opted to call it my “tentative solution” not because I plan on changing it anytime soon, but because I have learned that HR data infrastructure is not a destination. You aren’t ever “done” because the needs of your organization will change, as will the tools.

Having experienced this, I think the best you can do is to have an open mind, but also to know your own limitations. As HR Analytics professionals, we are not independent of the systems and processes we create; we are a part of them. For HR Data Infrastructure to function smoothly, then, we have to be comfortable with the tools we are using and the processes we are running. If not, we run the risk of creating something that is not only unsustainable but may also lead our stakeholders to the wrong conclusions.

In my experience, the journey towards HR data infrastructure is long and winding. When things get difficult, it’s reassuring to remember that many of us are treading the same path.

Disclaimer: This piece highlights my own journey and thinking on HR data infrastructure, and does not necessarily reflect my employer’s philosophy. Similarly, any comments on events or products are from my sole lens.

HR Analytics Certificate Program

Download Syllabus