Andy’s Neural Works (formerly Andrew’s Folly)

A wise folly can lead to great things!


The All Roads Project – Choosing the Database Engine

Published by

on

Introduction

Picking a database engine for an analytics project is a challenging task. The choice is often determined by what has already been established in the company ecosystem. For The All Roads Project, I have some flexibility since it’s a personal project. Budget is a factor, though. Fortunately, the world of Open Source Tools includes many, mature engines. These are used all throughout enterprise level systems. Given many choices, how to pick the best one? That is the purpose of the work in this article.

Starting point

Let’s start by getting a better grasp of the data. In the previous step, the texts for the Catechism of the Catholic Church (CCC) and Bible were extracted. I did make some notes from an initial view. The data can be summarized as a networked set of structured documents. This gives us a good indicator that relationships amongst data are critical. Still, multiple options need to be explored.

Think about the data organization as a critical initial step. We know the hierarchical structure starting with book information being the root, then pages, and finally paragraphs. There are also references dispersed amongst the text to consider too. There are even identifiers for different parts. Most certainly, more details must be thought out, but we can see a basic structure despite there being free-flowing text.

This does seem like relational databases are the leading option. I don’t want to close the door till that nail is really hammered down in my head. Again, multiple options need to be explored. Let’s consider some alternative options.

Types of databases

There is an ever growing list of database engines available. Here are just a few:

  • Hierarchical
  • Relational
  • Graph
  • Document 
  • Time-series

Basically, I need to choose between either a Relational Database Management System (RDBMS) or NoSQL1 store. This is a challenge as many engines have grown to include functionality on both sides. For example, some relational databases can store unstructured data in key-value pairs. I take this as a general good. Still, it is important to avoid rework so let’s jot down some basic requirements.

Requirements

R1. Document Storage

This is the biggest requirement. I will need to store free flow text and related references as well as identifiers. At this point, we’ve identifed an organization of records by book, page, and paragraph. There are also references. In essence, there is structure as well as the need for large text storage.

R2. Well Performing Joins

Since citations are a lookup to other books, you can quickly imagine the need for multiple database joins. A paragraph from one book can reference multiple books and multiple pages in another. This means there could be multiple index seeking efforts at the table level.

R3. Fast Searching

Searching through text is going to be critical. There are identifiers in these books. Indexing on these will help to avoid full table scan scenarios. The engine will need to have good key/indexing strategy options.

R4. Create lineage graph data

Here is an interesting item to consider. I want to create graphs of the relationships. There are graph databases that are a NoSQL variety of engine. These specialize in the ability to create nodes, edges, and the metrics related to social networking. This would certainly be a useful set of features for creating lineage.

R5. Data Consistency

NoSQL databases focus on availability vs ACID (Atomicity, Consistency, Isolation, Durability) compliance. There are different standards like BASE (Basically Available, Soft State, Eventual Consistency) that are followed. This project lends itself to ACID properties.

R6. Scale/Expansion/Alterations 

Most engines should handle alterations and additions easily. I will note that a NoSQL database provides the most flexibility. A benefit would be that NoSQL scales very well. I do not expect the file store to be unwieldy, but having the ability to grow is a good benefit.

R7. Hardware Requirements 

For my use, this needs to run on my MacBook Air with it’s M1 processor and 16GB of RAM. It’s actually powerful despite being a few years old. Still, this is my main machine so I don’t want to take it down.

There is also the possibility of using my Raspberry Pi 4. That might be a later experiment. I will keep that in mind. The Pi is good hardware but is not going to break any record benchmarks

R8. Cloud

I am considering cloud options if this gets a public release. Whatever engine I choose will need a host that I can afford. I could start off in the cloud too but want to contain the database locally first. I don’t want to be tied to external services during development. Again, there are usually costs to cloud computing. There is also the dependency on network consistency.

R9. Tool and Language Support

Since this is an analytics project, Python and R must be able to connect.

Too many options

That ends the core set of requirements. Now, let’s look at options I found. I do have personal preferences but those are for closed-source systems. There is certainly more flexibility for the large number of Open Source varieties. I did some basic research using the above requirements. From initial research, I went for the more talked about or popular packages. Here are the engines for consideration.

DB Option 1. MySQL2 – RDBMS

Consideration for MySQL is a given. This database is high performing and proven. In fact, one of its benefits is its performance in driving many of the websites on the Internet. There are plenty of resources to assist in achieving success with this database. A possible hurdle is that it’s functionality is limited compared to other options. For example, the community edition has limitations which require going to the commercial versions. This raises a flag of concern as I do not want to get into a situation where I reach a paywall.

DB Option 2. PostgreSQL3 – RDBMS/Object-Relational

This is another proven RDBMS. What sets this engine in its own light are the object-relational capabilities. This provides ability to have NoSQL options (albeit not its primary target audience). Thankfully, it is supported by various drivers and development languages.

DB Option 3. MongoDB4 – NoSQL

This is popular with the NoSQL crowd. It has been around for a while. My concern with this is not in its document storage capabilities but with the need for relationships to be built. It would not be a show stopper but would be something that might make development a bit more of a challenge.

DB Option 4. Neo4j5 – Graph

This is a dark horse candidate. This database is known for its ability to create network graphs. Having an optimized engine for this functionality would be a benefit. The issue with Neo4j would be that it is not meant for other aspects. I could see this as being used on its own, though. Maybe have an extract process that pushes to Neo4j for network graphs.

Decision

As stated in the beginning, choosing a database engine for a project is a challenging task. There are many options to consider. In this case, I decided to go with a RDBMS since there are relationships to be built. This leaves MySQL and PostgreSQL. Even though both could be used, I will go with PostgreSQL. There are features of PostgreSQL, such as more varied data types (JSON and JSONB, for example), that are going to prove useful. There is also the possible need to move up in editions for MySQL that concern me.

With the engine chosen, I will move onto the next step of defining the table structures. My goal will be to keep it simple. This will be the topic of the next article.

Thank you

References

[1] IBM. SQL vs. NoSQL Databases: What’s the Difference? June 12, 2022. Retrieved from: https://www.ibm.com/blog/sql-vs-nosql/

[2] Oracle. MySQL. Retrieved from: https://www.mysql.com/

[3] The PostgreSQL Global Development Group. PostgreSQL: The World’s Most Advanced Open Source Relational Database. Retrieved from: https://www.postgresql.org/

[4] MongoDB, Inc. MongoDB: The Developer Data Platform. Retrieved from: https://www.mongodb.com/

[5] Neo4j, Inc. Neo4j Graph Database & Analytics | Graph Database Management System. Retrieved from: https://neo4j.com/

One response to “The All Roads Project – Choosing the Database Engine”

  1. […] With the database engine chosen, we can now start building some objects! It’s time to create the database structures. This provides a direction for the project’s data flow. Let’s keep this as simple as it can be. I do expect changes as the project goes on. We’ll follow these steps: […]

    Like