The All Roads Project – Loading the CCC Data from Staging Area

Introduction

In the previous article related to this project, we were able to take the parsed text data from the Catechism of the Catholic Church (CCC)¹ and ingest into a set of staging tables. Now that this is complete, it is on to loading the data into the landing set of tables. I expect this to have some surprises.

I hope at this point, the reality of analytics work is becoming clear. Any project requiring data in an organized fashion demands time. As you become more familiar with these processes, you will become quicker at it yet the work still must be done. Shortcuts are not recommended.

Approach

For this project, I decided not to use tools. This is old school and has rough edges. At the same time, it is also practical. Knowing that these loads are one and done in their approach makes it direct. Sometimes setting up a tool is a project onto itself. Using built-in functionality can be a better choice.

If you read other articles on ETL processes, you know there are some transformations that will occur. These are usually done on during the load step. The important difference is that the addition of Ids and relationships are critical. Referential integrity building is not always straightforward.

Fortunately, the previous steps have helped prepare the entirety of data. We will have to consider how the individual records will meet the data architecture. This is where a good set of Source To Target Mappings (STTM) helps drastically. This will keep the thoughts from getting muddied as development happens.

Special Cases

Before going further, I will talk about the special cases. Texts_T is the first one. This is a very straightforward single record but important. This is the root of the hierarchy. All pages tie into this single text record. Since it is so simplistic, a quick single row insert will be used.

Next are the chapters. The layout for the CCC data was a bit more challenging than expected. For now, it is in a pause state. Even though it is an optional component, I will revisit this organization effort at another date.

The other special case is the abbreviations load. The source file is loaded directly using PostgreSQL’s pgAdmin. There is nothing fancy going on here so why complicate it.

Additions

When designing the load process, I had to go back and add-in an additional field. This is important to note as no design remains concrete. In this case, the source File_Name (data type set to varchar 1000) is added to Pages_T. This is needed to get the Page_Id for the other loads.

I also considered adding in Paragraph_Id to References_T. This seemed like a good way to tie the 2 together at first. After thinking about the order of the loads, I decided to avoid the possibility of circular references downstream.

Design

The design is primarily held in the following STTMs. Wherever necessary, I will add more detail. The resulting Data Manipulation Language (DML) is in the appendix section.

Reject Records

In a project where the process is executed on a schedule, we would need to be concerned with capturing the records that do not load according to the rules. They would be stored in a separate set of tables, typically called the reject tables, for further analysis. The load process would need to take that into consideration at each and every execution.

In this project, the loads are single execution. Any updates to the text will be done in a separate process whenever the next version comes out.

Order of loads

There needs to be an organized process for these loads due to Id dependency. The Page_Id is required by paragraphs and references so Pages must come before them. Also, Text_Id is required so that must come before all of them. The order of load will be:

Texts_T
Pages_T
Paragraphs_T
References_T

Load to Texts_T

As stated earlier, this table is a single record. This is to be entered manually.

Source Table	Source Field	Data Type	Transformation	Target Table	Target Field	Date Type
			Set to “1”	Texts_T	Text_Id	integer
			Set to “Catechism of the Catholic Church”	Texts_T	Title	character varying (1000)
			Set to “Catechism of the Catholic Church”	Texts_T	Description	text
			Set to “https://www.vatican.va/archive/ENG0015/_INDEX.HTM“	Texts_T	URL	character varying (2048)
			Set to NULL	Texts_T	Group	character varying (50)
			Set to “Religious Reference”	Texts_T	Type	character varying (50)
			Set to CURRENT_DATE	Texts_T	Audit_Create_Date	timestamp
			Set to “Andy”	Texts_T	Audit_Create_User	character varying (100)
			N/A	Texts_T	Audit_Update_Date	timestamp
			N/A	Texts_T	Audit_Update_User	character varying (100)
*Note: Single record insert

STTM for Loading Texts_T

Load to Pages_T

The pages load comes directly from the staging area without any joins. The item to remember is to use the Text_Id of 1 for CCC. For Page_Id, remember that the serial generation feature of PostgreSQL has been used. This is going to be the same for the remaining tables.

Source Table	Source Field	Data Type	Transformation	Target Table	Target Field	Date Type
			System Generated ID	Pages_T	Page_Id	integer
			Set to ROW_NUMBER() by sorting on the File_Name.	Pages_T	Page_Sequence_Id	integer
CCC_Pages_Stage_T	TItle	character varying (1000)		Pages_T	Title	character varying (1000)
			Set to NULL	Pages_T	Description	text
CCC_Pages_Stage_T	URL	character varying (2048)		Pages_T	URL	character varying (2048)
CCC_Pages_Stage_T	File_Name	character varying (1000)		Pages_T	File_Name	character varying (1000)
			Set to 1 for “CCC” Text_Id	Pages_T	Text_Id	integer
			Set to NULL	Pages_T	Chapter_Id	integer
			Set to CURRENT_DATE	Pages_T	Audit_Create_Date	timestamp
			Set to “Andy”	Pages_T	Audit_Create_User	character varying (100)
			N/A	Pages_T	Audit_Update_Date	timestamp
			N/A	Pages_T	Audit_Update_User	character varying (100)

STTM for Loading Pages_T

Load to Paragraphs_T

The paragraphs load is more than a straight load. It requires a join back to Pages_T. This is necessary to get the Page_Id for the page the paragraph is on. Text_Id can be manually set. For clarity, sorting is down to organize the order stored in the table.

There is also the conversion of the reference Ids from a delimited string to JSON format. This is where PostgreSQL’s functions come into play. It is simple matter of calling the function after parsing out the delimited string.

Source Table	Source Field	Data Type	Transformation	Target Table	Target Field	Date Type
			System Generated ID	Paragraphs_T	Paragraph_Id	integer
CCC_Paragraphs_Stage_T	Paragraph_Id	bigint	When Paragraph_Id = ‘-1’ then set to NULL	Paragraphs_T	Paragraph_Sequence_Id	character varying (1000)
CCC_Paragraphs_Stage_T	Paragraph_Text	text		Paragraphs_T	Paragraph_Text	text
CCC_Paragraphs_Stage_T	URL	character varying (2048)		Paragraphs_T	URL	character varying (2048)
CCC_Paragraphs_Stage_T	References	text	Convert comma delimited string to json using to_json	Paragraphs_T	Reference_Sequence_Ids	json
			Set to 1 for “CCC” Text_Id	Paragraphs_T	Text_Id	integer
CCC_Paragraphs_Stage_T	File_Name	character varying (1000)	Lookup Page_Id from Pages_T by joining to the number portion of the File_Name	Paragraphs_T	Page_Id	integer
			Set to CURRENT_DATE	Paragraphs_T	Audit_Create_Date	timestamp
			Set to “Andy”	Paragraphs_T	Audit_Create_User	character varying (100)
			N/A	Paragraphs_T	Audit_Update_Date	timestamp
			N/A	Paragraphs_T	Audit_Update_User	character varying (100)
*NOTE: Sort by Paragraph_Line_Number in an ascending order

STTM for Loading Paragraphs_T

Load to References_T

The references load is similar to the paragraphs load with Page_Id and Text_Id. A sort is also completed to assist in organization.

For future thought, there should be a better use of Reference_Id in the data architecture. This might assist in join speed.

Source Table	Source Field	Data Type	Transformation	Target Table	Target Field	Date Type
			System Generated ID	References_T	Reference_Id	integer
CCC_References_Stage_T	Reference_Id	text		References_T	Reference_Sequence_Id	character varying (1000)
CCC_References_Stage_T	Reference_Text	text		References_T	Reference_Text	text
			Set to NULL	References_T	URL	character varying (2048)
CCC_References_Stage_T	File_Name	character varying (1000)	Lookup Page_Id from Pages_T by joining to the number portion of the File_Name	References_T	Page_Id	integer
			Set to 1 for “CCC” Text_Id	References_T	Text_Id	integer
			Set to CURRENT_DATE	References_T	Audit_Create_Date	timestamp
			Set to “Andy”	References_T	Audit_Create_User	character varying (100)
			N/A	References_T	Audit_Update_Date	timestamp
			N/A	References_T	Audit_Update_User	character varying (100)
*Note: Sort by Page_Id and Reference_Line_Number in an ascending order

STTM for Loading References_T

Testing Results

Always remember to do some sort of data validation! In this phase, we will rely on record counts as well as checking join conditions.

Record Counts

Since there are no aggregations being done, there should be a 1:1 between source and target tables.

Source	Count	Target	Count
CCC_Pages_Stage_T	374	Pages_T	374
CCC_Paragraphs_Stage_T	4507	Paragraphs_T	4507
CCC_References_Stage_T	3684	References_T	3684

Record Count Check

The numbers line up 1:1 so we can move on to the next test.

Orphaned Joins

An orphan is a record that has no parent record. In some tables, this is not needed. In this project, there need to be joins back to the Pages_T and Text_T tables. This requires testing the joins.

Test for Text_Id is simply making sure the value of 1 is in all records.

SELECT count(*) FROM "AllRoads"."Pages_T" where "Text_Id" <> 1

0 records

SELECT * FROM "AllRoads"."Paragraphs_T" where "Text_Id" <> 1

0 records

SELECT count(*) FROM "AllRoads"."References_T" where "Text_Id" <> 1

0 records

All Text_Ids are set to 1. Again, simple since we have only 1 text.

Testing for Page_Id is more complicated. It needs to make sure that the Id exists and that is as expected. To do this, we should do some selects checking for missing joins. Starting with Paragraphs, check for any missing joins to Pages_T.

SELECT count(*) 
  FROM "AllRoads"."Paragraphs_T" pt
     left outer join
  "AllRoads"."Pages_T" pgt
  on (pt."Page_Id" = pgt."Page_Id")
 where pt."Page_Id" is null

0 records

That’s good. There are no missing joins. Let’s double check the authenticity by making sure there are no fallout records when joining.

SELECT count(*) 
  FROM "AllRoads"."Paragraphs_T" pt
     inner join
  "AllRoads"."Pages_T" pgt
  on (pt."Page_Id" = pgt.”Page_Id")

4507 records

That’s also good. Let’s repeat for the References_T

SELECT count(*) 
  FROM "AllRoads"."References_T" rt
     left outer join
  "AllRoads"."Pages_T" pt
  on (rt."Page_Id" = pt."Page_Id")
 where pt."Page_Id" is null

0 records

SELECT count(*) 
  FROM "AllRoads"."References_T" rt
     inner join
  "AllRoads"."Pages_T" pt
  on (rt."Page_Id" = pt."Page_Id")

3684 records

These may seem like basic tests but they are essential, as they prove the joins are working.

Spot check

As one final precaution, let’s do some joins and spot check.

SELECT tt."Title", pt."Title", rt."Reference_Text"
  FROM "AllRoads"."References_T" rt
     inner join
  "AllRoads"."Pages_T" pt
  on (rt."Page_Id" = pt."Page_Id")
  inner join
  "AllRoads"."Texts_T" tt
  on (rt."Text_Id" = tt.”Text_Id")

Here are the results (5 rows only shown).

Catechism of the Catholic Church	PROLOGUE	Jn 17 3
Catechism of the Catholic Church	PROLOGUE	1 Tim 2:3-4.
Catechism of the Catholic Church	PROLOGUE	Acts 4:12
Catechism of the Catholic Church	I. The life of man – to know and love God	Mt 28:19-20
Catechism of the Catholic Church	I. The life of man – to know and love God	Mk 16:20”

Query Results

This is as expected. Now, let’s put it all together with Pages, Paragraphs, and References.

select p."Page_Title"
       ,p."Paragraph_Sequence_Id"
  ,rt."Reference_Text"
from (
select pgt."Title" as "Page_Title"
  ,pgt."Page_Id"
  ,pt."Paragraph_Sequence_Id"
,json_array_elements_text(pt."Reference_Sequence_Ids")::int::text as "Reference_Sequence_Id"
  from "AllRoads"."Paragraphs_T" pt
  inner join
  "AllRoads"."Pages_T" pgt
  on (pt."Page_Id" = pgt."Page_Id")
) p
inner join
"AllRoads"."References_T" rt
on (p."Reference_Sequence_Id" = rt."Reference_Sequence_Id"
    and p."Page_Id" = rt."Page_Id")
order by p."Page_Id", p."Paragraph_Sequence_Id", rt.”Reference_Sequence_Id"

Here are the results (5 rows only shown).

I. The life of man – to know and love God	2	Mt 28:19-20
I. The life of man – to know and love God	2	Mk 16:20
I. The life of man – to know and love God	3	Cf. Acts 2:42
II. Handing on the Faith: Catechesis	10	Extraordinary Synod of Bishops 1985,. Final Report II B a, 4.
II. Handing on the Faith: Catechesis	10	John Paul II, Discourse at the Closing of the Extraordinary Synod of Bishops 7 December 1985: AAS 78, (1986).”

Query Results

That’s saying a lot in the results. That query shows that the architecture works and the results match what is in the CCC. I’m feeling confident at this point that the project is heading in the right direction.

Next Steps

Now that the CCC has been loaded into the database, it’s on to the Bible. I can see the challenge with this text being related to the many books contained within it. There are numerous files to handle. I tried to prepare them into a useable format but let’s see where it leads.

Thank you

Appendix

DML

Load Pages_T from CCC_Pages_Stage_T

INSERT INTO "AllRoads"."Pages_T"
(
"Page_Sequence_Id"
,"Title"
,"Description"
,"URL"
,"File_Name"
,"Text_Id"
,"Chapter_Id"
,"Audit_Create_Date"
,"Audit_Create_User"
)
SELECT ROW_NUMBER() OVER(ORDER BY CAST("File_Name" as INTEGER)) AS Page_Sequence_Id
       ,"Title"
  ,NULL as Description
  ,"URL"
  ,"File_Name"
  ,1 as Text_Id
  ,NULL as Chapter_Id
  ,CURRENT_DATE as Audit_Create_Date
  ,'Andy' as Audit_Create_User
  FROM "AllRoads"."CCC_Pages_Stage_T"

Load Paragraphs_T from CCC_Paragraphs_Stage_T

INSERT INTO "AllRoads"."Paragraphs_T"
(
"Paragraph_Sequence_Id"
,"Paragraph_Text"
,"URL"
,"Reference_Sequence_Ids"
,"Text_Id"
,"Page_Id"
,"Audit_Create_Date"
,"Audit_Create_User"
)
SELECT case 
when cpst."Paragraph_Id" = '-1' then null
else cpst."Paragraph_Id"
end as Paragraph_Sequence_Id
  ,"Paragraph_Text"
  ,"URL"
  ,to_json(string_to_array("References",',')) as Reference_Sequence_Ids
  ,1 as Text_Id
  ,pt."Page_Id"
  ,CURRENT_DATE AS Audit_Create_Date
  ,'Andy' as Audit_Create_User
  FROM "AllRoads"."CCC_Paragraphs_Stage_T" as cpst
       INNER JOIN
  "AllRoads"."Pages_T" as pt on (cpst."File_Name" = 'CCC.' || pt."File_Name")
ORDER BY pt."Page_Id", cpst.”Paragraph_Line_Number"

Load References_T from CCC_References_Stage_T

INSERT INTO "AllRoads"."References_T"
(
"Reference_Sequence_Id"
,"Reference_Text"
,"Page_Id"
,"Text_Id"
,"Audit_Create_Date"
,"Audit_Create_User"
)
SELECT "Reference_Id" as Reference_Sequence_Id
       ,COALESCE("Reference_Text", 'No Reference Text') as Reference_Text
  ,pt."Page_Id"
  ,1 as Text_Id
  ,CURRENT_DATE AS Audit_Create_Date
  ,'Andy' as Audit_Create_User
  FROM "AllRoads"."CCC_References_Stage_T" as crst
       INNER JOIN
  "AllRoads"."Pages_T" as pt on (crst."File_Name" = 'CCC.' || pt."File_Name")
ORDER BY pt."Page_Id"
      ,crst.”Reference_Line_Number"

References

[1] Vatican. Catechism of the Catholic Church. Retrieved from: https://www.vatican.va/archive/ENG0015/_INDEX.HTM

The All Roads Project – Loading the CCC Data from Staging Area

Share this: