Andy’s Neural Works

A wise folly can lead to great things!


Staging Data from Semi-Structured Text Files to Structured Tables – The All Roads Project

Published by

on

Preface

In this article, I will step through the process of staging data. This entails reviewing, cleansing, and other preparations of the source data to be loaded into tables. This is one of the most underestimated challenges in data engineering. 

Staging data properly requires thinking beyond a simple dump from source to target. Trying to do too much upfront might cause harm too. My hope is that the process I layout in this article will be useful to you in your data projects. Always keep in mind the goal is to ingest data to generate useable results.

I am going to define the scope of this article to be on the Catechism of the Catholic Church (CCC)1 Text previously extracted. The CCC is the root source for looking at other sources. This is essential for the project.

Introduction

Up until now, the work for The All Roads Project has been more high level. The final table structures are defined in the previous article. That step does gets more detailed but staging the data will bring us even closer to the atomic parts. We will have to review and deeply understand the source files. There are many files to go through but I think we can find some sort of pattern. 

Analysis

First, we will define the rules then we build out the solution. I think this will be some fun. It’s a bit like sleuthing the original publisher’s design thoughts.

Finding a Pattern

Going past data requirements, I want to see if there is a common layout to the source files. To do this, I always start with a visual inspection. Fortunately, most are similar to the following:

Don’t be overwhelmed in wondering how this is not a CSV, Tab, or other delimited file. Think beyond the commonality found in those formats. If you look a bit more loosely, you will see that there seems to be 5 main parts:

  • Page Header
  • Chapter Headers
  • Paragraphs
  • References
  • Page Footer

There, that wasn’t difficult to do. How to identify these sections?

Parsing

In each section, see there are common texts. Look at the words  “IntraText – Text.” Before it is the page header. Right after it are the chapter headers. That’s 2 sections and we haven’t broke a sweat.

Next, there is “Previous – Next” across the source files. This consistently occurs before and after the paragraphs and references. That’s good but now how to break out the paragraphs from the references?

Look closer and you will see a set of blank lines. Reviewing the files, this turns out to be 3-4 lines in succession. The rules are starting to get a bit more challenging but nothing we can’t handle!

For paragraphs, take note of that numeric identifier. That will need to be captured and separated out. Also, any other numbers indicate a reference citation. Those will need to be separated out too. *Note: not all paragraphs have an identifier. These will be defaulted to -1.

For references, the pattern is common with an identifier along with the cited material. Sometimes, there are multiple citations in one reference. Those concatenated with a semicolon. I warn about parsing out this by semicolon. It’s not a given so analysis might need to be done further down the line.

Finally, the page footer comes at the end. That’s it for the section parsing. 

With the components defined, we could just build out the solution now. Asking a question is important here. Is there anything that can be removed as being unnecessary?

Cleansing

For cleansing, it’s possible to get stuck in a never ending set of rules. This is going to be more simple as the CCC is essentially a reference book. We are using this as a book of record and we are not going to change the text. We will not be looking for items such as misspellings

We do have a few items for cleansing. The page header is not needed. Also, the page footer is not needed. We already understand this is the CCC and there is a copyright (which will be followed in any publications). Those two are not needed for analysis.

The blank lines will be removed too. That will be done after their use for parsing is complete. I’d rather save the space even though storage is relatively cheap (except for Apple Products).

Finally, non-printable characters will be removed. This is not easy and might require manual editing as needed. Yes, it is laborious to do, but is essential to get done.

Traceability

If we accept that there are no technical hurdles for parsing and cleansing, it should be a simple matter of putting it into a table. I’ll hold up a caution flag and say the word developers understand all too well: “debugging.” What if something goes wrong and we cannot trace it back? We will need some method to go backwards to the original source.

For each row, I will require:

  • The page name this row comes from.
  • The line number this row is at in the page.
  • The sequential line number in the section (1, 2, 3 for header, paragraphs, and references).

These should give us the necessary traceability to get back to the right area in the source. If there is a problem in processing, we can follow multiple debugging techniques.

Cleansed File Structure

The resulting files will have the following field layout:

Headers (file name will be: “CCC.” followed by extract number, “.h.txt”)

  • File_Name
  • Page_Line_Number
  • Header_Line_Number
  • Header_Text

Paragraphs (file name will be: “CCC.” followed by extract number, “.p.txt”)

  • File_Name
  • Page_Line_Number
  • Paragraph_Line_Number
  • Paragraph_Id
  • Paragraph_Text
  • References

References (file name will be: “CCC.” followed by extract number, “.r.txt”)

  • File_Name
  • Page_Line_Number
  • Reference_Line_Number
  • Reference_Id
  • Reference_Text

I will rely on the rules and file layouts as the design documentation needed to build the parsing and cleansing process. A formal source to target mapping (STTM) could be a bit confusing. That effort will be done with the database ingestion process. We now have enough material and direction to start building.

Building

Given that there are customized and specific rules, I decided not to go with a pre-built tool. They can be very useful in enterprise environments. In this circumstance, a common programming language with text parsing, such as Python, will be more flexible. After much thought, I went even simpler and decided upon our friend awk

It is built into many OS’es, including MacOS, and performs extremely well. If you can understand regular expression syntax, then this is the tool to use for text parsing and organization. All that is needed is a text editor and a command prompt open. Pretty IDEs are nice too but not always a given.

The resulting awk script is in the appendix. This did not take long at all. The analysis is the hardest part. Coding it reminded me of how the original efforts in Bell Labs2 still have use today.

As a final step, I gathered together the created files into 3 master files:

  • CCC.all.h.txt for all of the extracted headers.
  • CCC.all.p.txt for all of the extracted paragraphs.
  • CCC.all.r.txt for all of the extracted references

This makes the import step easier with 1 file for each table. If not, a separate script will need to be created to process the multitude of files.

Validation

Before staging into the database, there needs to be some validation effort run. This is going to be tricky. The reason is that there has been some cleansing of blanks and redundant text. There are also additions of delimiters (pipe is chosen), line numbers, and identifiers. This all means that counts will not be an exact match. Going through and visually spot checking many files, it certainly does look like the process worked. However, let’s see if we can guesstimate some sort of response.

We might be able to estimate a value taking the previous determined number of lines as a base. Here are some of the metrics previously captured:

  • # of files: 374
  • # of lines: 26153
  • # of words: 275089

Using the number of lines as a primary metric, let’s determine what is removed during the cleansing process.

Blank lines: 

  • 6 before header
  • 4 after header/before paragraph
  • 4 after paragraph/before references
  • 2 in-between references (the number of references per file is variable). Running some quick statistics on the files, the average turns out to be 12.5 references per page. This seems very high. I will guesstimate at 10 references per page as a good measure. This gives for 20 blank lines.
  • 5 after references

Total: ~39 lines per file

Removed text:

  • 2 lines containing “Previous-Next”
  • 1 line containing “IntraText – Text”
  • 1 line containing “Help”
  • 1 line for CCC Title
  • 1 line for copyright notice

Total: ~6 lines per file

With 374 files and ~45 lines being removed per file, that is 16830 lines. That brings the line count down to ~9323 lines.

Running a line count on the different files, we have:

  • CCC.all.h.txt: 1705 lines
  • CCC.all.p.txt: 4507 lines
  • CCC.all.r.txt: 3684 lines

Total: 9896 # of lines

That’s a difference of 573 lines. This is probably due to the variability in the number of references mentioned earlier. I would say that is good for a guesstimate. Still, more manual spot checking is done to look for duplicates, truncated lines, etc. With that complete, I feel confident that the process is working as expected.

*Note: the URLs file from the data collection stage is 374 lines.

Defining the Staging Area

This might seem like cheating, but I decided to design the staging tables to match the processed source material.  There are 4 tables:

  • CCC_Pages_Stage_T (*Note: this is just the url source from the previous article on pulling the data)
  • CCC_Headers_Stage_T
  • CCC_Paragraphs_Stage_T
  • CCC_References_Stage_T

Using the ERD Tool in pgAdmin, the following definition results:

There are no foreign keys between these tables. I also decided against setting a primary key. This is purely for simplicity sake, but one can truly be defined for each. The DDL is in the appendix and should be steady for the rest of the project.

Staging the Data

A source to target mapping is needed for clarity. It’s important for blueprinting purposes despite the design being 1:1 between file and source. Here are the resulting maps:

Source FileSource Field PositionTransformationTarget DatabaseTarget TableTarget Field
extracted_urls.vatican.ccc.all.txt1NoneAllRoadsCCC_Pages_Stage_TFile_Name
extracted_urls.vatican.ccc.all.txt2NoneAllRoadsCCC_Pages_Stage_TTitle
extracted_urls.vatican.ccc.all.txt3NoneAllRoadsCCC_Pages_Stage_TURL
STTM for Staging Pages
Source FileSource Field PositionTransformationTarget DatabaseTarget TableTarget Field
CCC.all.h.txt1NoneAllRoadsCCC_Header_Stage_TFile_Name
CCC.all.h.txt2NoneAllRoadsCCC_Header_Stage_TPage_Line_Number
CCC.all.h.txt3NoneAllRoadsCCC_Header_Stage_THeader_Line_Number
CCC.all.h.txt4NoneAllRoadsCCC_Header_Stage_THeader_Text
STTM for Staging Headers
Source FileSource Field PositionTransformationTarget DatabaseTarget TableTarget Field
CCC.all.p.txt1NoneAllRoadsCCC_Paragraphs_Stage_TFile_Name
CCC.all.p.txt2NoneAllRoadsCCC_Paragraphs_Stage_TPage_Line_Number
CCC.all.p.txt3NoneAllRoadsCCC_Paragraphs_Stage_TParagraph_Line_Number
CCC.all.p.txt4NoneAllRoadsCCC_Paragraphs_Stage_TParagraph_Id
CCC.all.p.txt5NoneAllRoadsCCC_Paragraphs_Stage_TParagraph_Text
CCC.all.p.txt6NoneAllRoadsCCC_Paragraphs_Stage_TReferences
STTM for Staging Paragraphs
Source FileSource Field PositionTransformationTarget DatabaseTarget TableTarget Field
CCC.all.r.txt1NoneAllRoadsCCC_References_Stage_TFile_Name
CCC.all.r.txt2NoneAllRoadsCCC_References_Stage_TPage_Line_Number
CCC.all.r.txt3NoneAllRoadsCCC_References_Stage_TReference_Line_Number
CCC.all.r.txt4NoneAllRoadsCCC_References_Stage_TReference_Id
CCC.all.r.txt5NoneAllRoadsCCC_References_Stage_TReference_Text
STTM for Staging References

As you can see, a STTM can be flexible and easy to do. Next up, let’s stage these files! This is where having a great database such as PostgreSQL3 is essential. Loading these data is a matter of using pgAdmin’s import utility.

Let’s take a look at the results to make sure.

Validation Again

This validation step should be very accurate. It should be 1:1 in terms of record counts. The results are:

  • CCC_Pages_Stage_T: 374 records
  • CCC_Headers_Stage_T: 3684 records
  • CCC_Paragraphs_Stage_T: 4507 
  • CCC_References_Stage_T: 1705 records

That lines up as expected. We could do a similar exercise with a word count. I am going to trust my visual spot check of the resulting tables. All seems right in its place. This completes the staging efforts for the CCC.

Next Steps

As previously suggested, staging data is an important step in the ingestion process. It is not always a direct copy but it always requires proper analysis. The end results show that there are now usable data. I can now move forward with loading records into the final structure. I will also take care of the abbreviation list as straight load. After that is complete, I will then be able to add-on other texts and get to some research!

Thank you

Appendix

DDL Script

BEGIN;


CREATE TABLE IF NOT EXISTS "AllRoads"."CCC_Header_Stage_T"
(
    "File_Name" character varying(1000) NOT NULL,
    "Page_Line_Number" bigint NOT NULL,
    "Header_Line_Number" bigint NOT NULL,
    "Header_Text" text NOT NULL
);

CREATE TABLE IF NOT EXISTS "AllRoads"."CCC_Paragraphs_Stage_T"
(
    "File_Name" character varying(1000) NOT NULL,
    "Page_Line_Number" bigint NOT NULL,
    "Paragraph_Line_Number" bigint NOT NULL,
    "Paragraph_Id" bigint NOT NULL,
    "Paragraph_Text" text NOT NULL,
    "References" text
);

CREATE TABLE IF NOT EXISTS "AllRoads"."CCC_References_Stage_T"
(
    "File_Name" character varying(1000) NOT NULL,
    "Page_Line_Number" bigint NOT NULL,
    "Reference_Line_Number" bigint NOT NULL,
    "Reference_Id" text NOT NULL,
    "Reference_Text" text NOT NULL
);

CREATE TABLE IF NOT EXISTS "AllRoads"."CCC_Pages_Stage_T"
(
    "File_Name" character varying(1000) NOT NULL,
    "Title" character varying(1000) NOT NULL,
    "URL" character varying(2048) NOT NULL
);
END;

awk Script (named: ccc_extract_info.awk)

# Date: 02-APR-2024
# Author: Andy Della Vecchia
# Purpose: Parse out paragraph references from source files

BEGIN {

  # these are used to track where the read is in the file
  section_number = 0
  blank_line_count = 0
  subsection_number = 0
  page_line_number = 0
  h_line_number = 0
  p_line_number = 0
  r_line_number = 0
  
  # get the file name being processed
  split(ARGV[1], file_pieces, ".")
  fn = "CCC." file_pieces[length(file_pieces)-1]

  # build out the individual file names (h=header; p=paragraph; r=references)
  out_path = "./out/"
  header_file = out_path fn ".h.txt"
  paragraph_file = out_path fn ".p.txt"
  reference_file = out_path fn ".r.txt"
}

{

  # track line number for debugging purposes
  page_line_number++

  # when the "Previous - Next" is found
  if ($0 ~ /Previous - Next/ || $0 ~ /IntraText - Text/) {

    # Track the section
    section_number++
   
  } else {

    # section 1 is the header section
    if (section_number == 1 && NF != 0) {

       # 1 - File Name
       # 2 - Page Line # (sequential for each page)
       # 3 - Header line # (sequential for each header line #)
       # 4 - Header Text

       h_line_number++
       gsub(/[^a-zA-Z0-9 ]/, "")
       print fn "|" page_line_number "|" h_line_number "|" $0 > header_file

    # section 2 is the main section with the paragraph and references
    } else if (section_number == 2) {

	# if not a blank line or a very small text
	if (NF != 0 && length($0) > 2) {

		# subsection 1 is the paragraphs
		if (subsection_number == 1) {

			# split the line
			# 1 - File Name
			# 2 - Page Line # (sequential for each page)
                        # 3 - Paragraph line # (sequential for each paragraph #)
			# 4 - Paragraph Id (if doesn't exist, -1)
			# 5 - Text (including reference #s)
			# 6 - References - comma delimited

			p_line_number++
			ref_nums = ""

                        # if the first field has an id, separate it out, otherwise, use -1
			if ($1 ~ /[0-9]+/ && !($1 ~/[A-Za-z+]/)) {
				gsub(/[.]/, "", $1)
				$1 = fn "|" page_line_number "|" p_line_number "|" $1 "|"
				p_id_exists = 0

			} else {
				$1 = fn "|" page_line_number "|" p_line_number "|" "-1" "|" $1
				p_id_exists = 1
			}

			# find any numeric values and pull out as reference Ids
			for (i=2 + p_id_exists; i<=NF; i++) {

			   if ($i ~ /[0-9]+/) {

				bkp_fld = $i
				gsub(/[^0-9+]/, "", $i)

				if (ref_nums != "") {
				   ref_nums = ref_nums "," $i
				} else {
				   ref_nums = $i
				}

				# make sure original field is restored
				$i = bkp_fld
			   }
			}

			# add on the reference numbers and print out to file
                        $0 = $0 "|" ref_nums
			print $0 > paragraph_file

		# subsection 2 is where the references are stored
		} else if (subsection_number == 2) {

			# 1 - File Name
                        # 2 - Page line number
                        # 3 - reference line number
			# 4 - Reference Id from text
			# 5 - References

			r_line_number++
			$1 = fn "|" page_line_number "|" r_line_number "|" $1 "|"
			print $0 > reference_file

		}

		# remember that there are no blank lines
	        blank_line_count = 0

	# blank line condition
	} else {

		blank_line_count++

	        if (blank_line_count >= 3) {
		    subsection_number++
		    blank_line_count = 0
        	}
	}

    }

  }

}

Shell Script to Run awk Process (named: run_ccc_extract.sh)

for in_file in ../../data/CCC/*.txt; do
    echo $in_file
    awk -f ccc_extract_info.awk $in_file
done

References

[1] Vatican. Catechism of the Catholic Church. Retrieved from: https://www.vatican.va/archive/ENG0015/_INDEX.HTM

[2] Free Software Foundation. History of awk and gawk. Retrieved from: https://www.gnu.org/software/gawk/manual/html_node/History.html

[3] The PostgreSQL Global Development Group. PostgreSQL: The World’s Most Advanced Open Source Relational Database. Retrieved from: https://www.postgresql.org/