Andy’s Neural Works

A wise folly can lead to great things!


Scope Your Data Capture

Published by

on

*Upfront caveat here for the post: Do not think of this article as a path for designing a full-blown system. The point of all of this discussion is to avoid creating a massive system when all you need is a spreadsheet.

It is easy to criticize data sources for being off-topic and messy. It is another concern when that data source is something that you build and maintain. How do you go about creating a way of capturing data that is simple but on-topic and accurate?

Building a full-fledged data model for storage can be a long, complicated effort. After all, you need to go from capturing business data definition to physical implementation while keeping everything organized and relevant.

Scope definition is critical. Let us keep this simple and stick to a tight path. In this example, let us say the effort is to capture basic information for a company that manufactures glass panes. These data will eventually be used for sales order analytics built in a language such as R, Python, or SAS.

The scope is now defined. This one does not seem challenging, but can scope be broken? The answer is yes. You must be able to recognize when it happens to be able to realign efforts to the goal.

The first item for data capture is to determine what data need to be captured? This initial step requires you to go to the business and ask the question “what defines a sale?” For example, a business that creates glass panes for windows might write a definition as 

“Our business manufactures panes of glass. These panes have standard colors, opacity, shapes, and sizes sold and shipped directly to our clients.”

That is a lot of information crammed into 2 sentences. These certainly can be dissected into requirements.

It is already seen there are conceptual entities for glass pane standards that need to be captured as well as client information and ordering. Maybe you are already designing it out in your head. Do not go too far down the path or you might over-engineer it. 

A simple exercise that can be done for requirements analysis would be to list out the entities and attributes. Consider that glass pane standards entity and list its attributes:

  • A glass pane has a value for opacity. An example is 10%.
  • A glass pane has a color. An example is “rose.”
  • A glass pane has a width. An example is…
  • A glass pane has a height.

And so on

You also list out the relationships:

  • A customer has a shipping address.
  • A customer has a glass pane order.

This goes on till you cover your attributes. You could create a simple conceptual diagram for showing thoughts to the business sponsor. In this case, it might be too simplistic. The result of your effort should be to form a data contract where everyone agrees on what needs to be captured. Once agreed upon, you can move on to building out a design and follow-through with implementation.

If you have ever used a structured RDBMS, there are levels of normalization as well as key analysis that needs to take place. This can be quite the task. With the business mission, a few entities, and attributes defined, we can start listing out tables like this:

  • Client Contact
  • Client Addresses
  • Client Orders
  • Glass Pane Standard Definition
  • Glass Colors
  • Glass Shapes

What else?

Wait!

Ask yourself if this implementation is getting out of control by trying to follow an existing design pattern out of a textbook? The approach went from simple “Client” and “Glass Pane” entities into a series of tables with complex relationships. It now requires the analyst coding the analytics to go through a complicated data integration process whenever the source is used. Remember what you are trying to accomplish.

The goal is for a R developer to generate reporting figures on sales (ex: how many glass panes of size X and color Y were sold to customer Z). Not only are highly normalized data structures troubling to maintain in a spreadsheet but joining sets requires coding efforts and testing to make sure nothing is lost. More joins results in more complexity. Complexity often leads to lost time and maintenance frustration.

Going down this expanded implementation also means that the project is building out a larger system rather than a basic data source. Take a few breaths and remember the goal. Let us keep this simple so everyone can sleep comfortably at night (vs another database system to monitor). How about everyone stays on target with the basics:

  • Client Contact Information
  • Glass Pane Configurations
  • Client Orders

That is both reasonable and more manageable than trying to build-out a full system. It is staying within scope of basic data capture. There can be a single workbook in your spreadsheet application of choice with 3 sheets that have the specific attributes in it. This will also help with the down-road dependencies since the file can be easily read into the statistical package of choice (ex: readxl).

Thinking creatively, you might even get away with everything in that spreadsheet application (you will be surprised what tools are available now. Just run some web and YouTube queries on this topic and you can find an example like this one). That is for another project, though. We do not want to risk getting out of scope again.

I hope that I am not abrupt. I have seen too many systems that are over-engineered (*note: experience and failures can be good – “Do not fear mistakes. You will know failure. Continue to reach out.” -Benjamin Franklin). The reason often is that the initial scope is not properly defined and followed through on. Going from end to end, it is seen that data capture does not necessarily mean a massive effort.

I hope this approach proves useful for you. Capturing data is the goal not building a data capture system. Using proven techniques for analysis along with pragmatic implementation results in a quick turn around and longer-term happiness.

Thank You and Have a Great Day!