Andy’s Neural Works

A wise folly can lead to great things!


The All Roads Project – Loading the CCC Data from Staging Area

Published by

on

Introduction

In the previous article related to this project, we were able to take the parsed text data from the Catechism of the Catholic Church (CCC)1 and ingest into a set of staging tables. Now that this is complete, it is on to loading the data into the landing set of tables. I expect this to have some surprises.

I hope at this point, the reality of analytics work is becoming clear. Any project requiring data in an organized fashion demands time. As you become more familiar with these processes, you will become quicker at it yet the work still must be done. Shortcuts are not recommended.

Approach

For this project, I decided not to use tools. This is old school and has rough edges. At the same time, it is also practical. Knowing that these loads are one and done in their approach makes it direct. Sometimes setting up a tool is a project onto itself. Using built-in functionality can be a better choice.

If you read other articles on ETL processes, you know there are some transformations that will occur. These are usually done on during the load step. The important difference is that the addition of Ids and relationships are critical. Referential integrity building is not always straightforward.

Fortunately, the previous steps have helped prepare the entirety of data. We will have to consider how the individual records will meet the data architecture. This is where a good set of Source To Target Mappings (STTM) helps drastically. This will keep the thoughts from getting muddied as development happens.

Special Cases

Before going further, I will talk about the special cases. Texts_T is the first one. This is a very straightforward single record but important. This is the root of the hierarchy. All pages tie into this single text record. Since it is so simplistic, a quick single row insert will be used.

Next are the chapters. The layout for the CCC data was a bit more challenging than expected. For now, it is in a pause state. Even though it is an optional component, I will revisit this organization effort at another date.

The other special case is the abbreviations load. The source file is loaded directly using PostgreSQL’s pgAdmin. There is nothing fancy going on here so why complicate it.

Additions

When designing the load process, I had to go back and add-in an additional field. This is important to note as no design remains concrete. In this case, the source File_Name (data type set to varchar 1000) is added to Pages_T. This is needed to get the Page_Id for the other loads.

I also considered adding in Paragraph_Id to References_T. This seemed like a good way to tie the 2 together at first. After thinking about the order of the loads, I decided to avoid the possibility of circular references downstream.

Design

The design is primarily held in the following STTMs. Wherever necessary, I will add more detail. The resulting Data Manipulation Language (DML) is in the appendix section.

Reject Records

In a project where the process is executed on a schedule, we would need to be concerned with capturing the records that do not load according to the rules. They would be stored in a separate set of tables, typically called the reject tables, for further analysis. The load process would need to take that into consideration at each and every execution.

In this project, the loads are single execution. Any updates to the text will be done in a separate process whenever the next version comes out.

Order of loads

There needs to be an organized process for these loads due to Id dependency. The Page_Id is required by paragraphs and references so Pages must come before them. Also, Text_Id is required so that must come before all of them. The order of load will be:

  1. Texts_T
  2. Pages_T
  3. Paragraphs_T
  4. References_T

Load to Texts_T

As stated earlier, this table is a single record. This is to be entered manually.

Source TableSource FieldData TypeTransformationTarget TableTarget FieldDate Type



Set to “1”Texts_TText_Idinteger



Set to “Catechism of the Catholic Church”Texts_TTitlecharacter varying (1000)



Set to “Catechism of the Catholic Church”Texts_TDescriptiontext



Set to “https://www.vatican.va/archive/ENG0015/_INDEX.HTMTexts_TURLcharacter varying (2048)



Set to NULLTexts_TGroupcharacter varying (50)



Set to “Religious Reference”Texts_TTypecharacter varying (50)



Set to CURRENT_DATETexts_TAudit_Create_Datetimestamp



Set to “Andy”Texts_TAudit_Create_Usercharacter varying (100)



N/ATexts_TAudit_Update_Datetimestamp



N/ATexts_TAudit_Update_Usercharacter varying (100)
*Note: Single record insert





STTM for Loading Texts_T

Load to Pages_T

The pages load comes directly from the staging area without any joins. The item to remember is to use the Text_Id of 1 for CCC. For Page_Id, remember that the serial generation feature of PostgreSQL has been used. This is going to be the same for the remaining tables.

Source TableSource FieldData TypeTransformationTarget TableTarget FieldDate Type



System Generated IDPages_TPage_Idinteger



Set to ROW_NUMBER() by sorting on the File_Name.Pages_TPage_Sequence_Idinteger
CCC_Pages_Stage_TTItlecharacter varying (1000)
Pages_TTitlecharacter varying (1000)



Set to NULLPages_TDescriptiontext
CCC_Pages_Stage_TURLcharacter varying (2048)
Pages_TURLcharacter varying (2048)
CCC_Pages_Stage_TFile_Namecharacter varying (1000)
Pages_TFile_Namecharacter varying (1000)



Set to 1 for “CCC” Text_IdPages_TText_Idinteger



Set to NULLPages_TChapter_Idinteger



Set to CURRENT_DATEPages_TAudit_Create_Datetimestamp



Set to “Andy”Pages_TAudit_Create_Usercharacter varying (100)



N/APages_TAudit_Update_Datetimestamp



N/APages_TAudit_Update_Usercharacter varying (100)
STTM for Loading Pages_T

Load to Paragraphs_T

The paragraphs load is more than a straight load. It requires a join back to Pages_T. This is necessary to get the Page_Id for the page the paragraph is on. Text_Id can be manually set. For clarity, sorting is down to organize the order stored in the table.

There is also the conversion of the reference Ids from a delimited string to JSON format. This is where PostgreSQL’s functions come into play. It is simple matter of calling the function after parsing out the delimited string.

Source TableSource FieldData TypeTransformationTarget TableTarget FieldDate Type



System Generated IDParagraphs_TParagraph_Idinteger
CCC_Paragraphs_Stage_TParagraph_IdbigintWhen Paragraph_Id = ‘-1’ then set to NULLParagraphs_TParagraph_Sequence_Idcharacter varying (1000)
CCC_Paragraphs_Stage_TParagraph_Texttext
Paragraphs_TParagraph_Texttext
CCC_Paragraphs_Stage_TURLcharacter varying (2048)
Paragraphs_TURLcharacter varying (2048)
CCC_Paragraphs_Stage_TReferencestextConvert comma delimited string to json using to_jsonParagraphs_TReference_Sequence_Idsjson



Set to 1 for “CCC” Text_IdParagraphs_TText_Idinteger
CCC_Paragraphs_Stage_TFile_Namecharacter varying (1000)Lookup Page_Id from Pages_T by joining to the number portion of the File_NameParagraphs_TPage_Idinteger



Set to CURRENT_DATEParagraphs_TAudit_Create_Datetimestamp



Set to “Andy”Paragraphs_TAudit_Create_Usercharacter varying (100)



N/AParagraphs_TAudit_Update_Datetimestamp



N/AParagraphs_TAudit_Update_Usercharacter varying (100)
*NOTE: Sort by Paragraph_Line_Number in an ascending order





STTM for Loading Paragraphs_T

Load to References_T

The references load is similar to the paragraphs load with Page_Id and Text_Id. A sort is also completed to assist in organization.

For future thought, there should be a better use of Reference_Id in the data architecture. This might assist in join speed.

Source TableSource FieldData TypeTransformationTarget TableTarget FieldDate Type



System Generated IDReferences_TReference_Idinteger
CCC_References_Stage_TReference_Idtext
References_TReference_Sequence_Idcharacter varying (1000)
CCC_References_Stage_TReference_Texttext
References_TReference_Texttext



Set to NULLReferences_TURLcharacter varying (2048)
CCC_References_Stage_TFile_Namecharacter varying (1000)Lookup Page_Id from Pages_T by joining to the number portion of the File_NameReferences_TPage_Idinteger



Set to 1 for “CCC” Text_IdReferences_TText_Idinteger



Set to CURRENT_DATEReferences_TAudit_Create_Datetimestamp



Set to “Andy”References_TAudit_Create_Usercharacter varying (100)



N/AReferences_TAudit_Update_Datetimestamp



N/AReferences_TAudit_Update_Usercharacter varying (100)
*Note: Sort by Page_Id and Reference_Line_Number in an ascending order





STTM for Loading References_T

Testing Results

Always remember to do some sort of data validation! In this phase, we will rely on record counts as well as checking join conditions.

Record Counts

Since there are no aggregations being done, there should be a 1:1 between source and target tables.

SourceCountTargetCount
CCC_Pages_Stage_T374Pages_T374
CCC_Paragraphs_Stage_T4507Paragraphs_T4507
CCC_References_Stage_T3684References_T3684
Record Count Check

The numbers line up 1:1 so we can move on to the next test.

Orphaned Joins

An orphan is a record that has no parent record. In some tables, this is not needed. In this project, there need to be joins back to the Pages_T and Text_T tables. This requires testing the joins.

Test for Text_Id is simply making sure the value of 1 is in all records.

SELECT count(*) FROM "AllRoads"."Pages_T" where "Text_Id" <> 1

0 records

SELECT * FROM "AllRoads"."Paragraphs_T" where "Text_Id" <> 1

0 records

SELECT count(*) FROM "AllRoads"."References_T" where "Text_Id" <> 1

0 records

All Text_Ids are set to 1. Again, simple since we have only 1 text.

Testing for Page_Id is more complicated. It needs to make sure that the Id exists and that is as expected. To do this, we should do some selects checking for missing joins. Starting with Paragraphs, check for any missing joins to Pages_T.

SELECT count(*) 
  FROM "AllRoads"."Paragraphs_T" pt
     left outer join
  "AllRoads"."Pages_T" pgt
  on (pt."Page_Id" = pgt."Page_Id")
 where pt."Page_Id" is null

0 records

That’s good. There are no missing joins. Let’s double check the authenticity by making sure there are no fallout records when joining.

SELECT count(*) 
  FROM "AllRoads"."Paragraphs_T" pt
     inner join
  "AllRoads"."Pages_T" pgt
  on (pt."Page_Id" = pgt.”Page_Id")

4507 records

That’s also good. Let’s repeat for the References_T

SELECT count(*) 
  FROM "AllRoads"."References_T" rt
     left outer join
  "AllRoads"."Pages_T" pt
  on (rt."Page_Id" = pt."Page_Id")
 where pt."Page_Id" is null

0 records

SELECT count(*) 
  FROM "AllRoads"."References_T" rt
     inner join
  "AllRoads"."Pages_T" pt
  on (rt."Page_Id" = pt."Page_Id")

3684 records

These may seem like basic tests but they are essential, as they prove the joins are working.

Spot check

As one final precaution, let’s do some joins and spot check.

SELECT tt."Title", pt."Title", rt."Reference_Text"
  FROM "AllRoads"."References_T" rt
     inner join
  "AllRoads"."Pages_T" pt
  on (rt."Page_Id" = pt."Page_Id")
  inner join
  "AllRoads"."Texts_T" tt
  on (rt."Text_Id" = tt.”Text_Id")

Here are the results (5 rows only shown).

Catechism of the Catholic ChurchPROLOGUEJn 17 3
Catechism of the Catholic ChurchPROLOGUE1 Tim 2:3-4.
Catechism of the Catholic ChurchPROLOGUEActs 4:12
Catechism of the Catholic ChurchI. The life of man – to know and love GodMt 28:19-20
Catechism of the Catholic ChurchI. The life of man – to know and love GodMk 16:20”
Query Results

This is as expected. Now, let’s put it all together with Pages, Paragraphs, and References.

select p."Page_Title"
       ,p."Paragraph_Sequence_Id"
  ,rt."Reference_Text"
from (
select pgt."Title" as "Page_Title"
  ,pgt."Page_Id"
  ,pt."Paragraph_Sequence_Id"
,json_array_elements_text(pt."Reference_Sequence_Ids")::int::text as "Reference_Sequence_Id"
  from "AllRoads"."Paragraphs_T" pt
  inner join
  "AllRoads"."Pages_T" pgt
  on (pt."Page_Id" = pgt."Page_Id")
) p
inner join
"AllRoads"."References_T" rt
on (p."Reference_Sequence_Id" = rt."Reference_Sequence_Id"
    and p."Page_Id" = rt."Page_Id")
order by p."Page_Id", p."Paragraph_Sequence_Id", rt.”Reference_Sequence_Id"

Here are the results (5 rows only shown).

I. The life of man – to know and love God2Mt 28:19-20
I. The life of man – to know and love God2Mk 16:20
I. The life of man – to know and love God3Cf. Acts 2:42
II. Handing on the Faith: Catechesis10Extraordinary Synod of Bishops 1985,. Final Report II B a, 4.
II. Handing on the Faith: Catechesis10John Paul II, Discourse at the Closing of the Extraordinary Synod of Bishops 7 December 1985: AAS 78, (1986).”
Query Results

That’s saying a lot in the results. That query shows that the architecture works and the results match what is in the CCC. I’m feeling confident at this point that the project is heading in the right direction.

Next Steps

Now that the CCC has been loaded into the database, it’s on to the Bible. I can see the challenge with this text being related to the many books contained within it. There are numerous files to handle. I tried to prepare them into a useable format but let’s see where it leads.

Thank you

Appendix

DML

Load Pages_T from CCC_Pages_Stage_T

INSERT INTO "AllRoads"."Pages_T"
(
"Page_Sequence_Id"
,"Title"
,"Description"
,"URL"
,"File_Name"
,"Text_Id"
,"Chapter_Id"
,"Audit_Create_Date"
,"Audit_Create_User"
)
SELECT ROW_NUMBER() OVER(ORDER BY CAST("File_Name" as INTEGER)) AS Page_Sequence_Id
       ,"Title"
  ,NULL as Description
  ,"URL"
  ,"File_Name"
  ,1 as Text_Id
  ,NULL as Chapter_Id
  ,CURRENT_DATE as Audit_Create_Date
  ,'Andy' as Audit_Create_User
  FROM "AllRoads"."CCC_Pages_Stage_T" 

Load Paragraphs_T from CCC_Paragraphs_Stage_T

INSERT INTO "AllRoads"."Paragraphs_T"
(
"Paragraph_Sequence_Id"
,"Paragraph_Text"
,"URL"
,"Reference_Sequence_Ids"
,"Text_Id"
,"Page_Id"
,"Audit_Create_Date"
,"Audit_Create_User"
)
SELECT case 
when cpst."Paragraph_Id" = '-1' then null
else cpst."Paragraph_Id"
end as Paragraph_Sequence_Id
  ,"Paragraph_Text"
  ,"URL"
  ,to_json(string_to_array("References",',')) as Reference_Sequence_Ids
  ,1 as Text_Id
  ,pt."Page_Id"
  ,CURRENT_DATE AS Audit_Create_Date
  ,'Andy' as Audit_Create_User
  FROM "AllRoads"."CCC_Paragraphs_Stage_T" as cpst
       INNER JOIN
  "AllRoads"."Pages_T" as pt on (cpst."File_Name" = 'CCC.' || pt."File_Name")
ORDER BY pt."Page_Id", cpst.”Paragraph_Line_Number"

Load References_T from CCC_References_Stage_T

INSERT INTO "AllRoads"."References_T"
(
"Reference_Sequence_Id"
,"Reference_Text"
,"Page_Id"
,"Text_Id"
,"Audit_Create_Date"
,"Audit_Create_User"
)
SELECT "Reference_Id" as Reference_Sequence_Id
       ,COALESCE("Reference_Text", 'No Reference Text') as Reference_Text
  ,pt."Page_Id"
  ,1 as Text_Id
  ,CURRENT_DATE AS Audit_Create_Date
  ,'Andy' as Audit_Create_User
  FROM "AllRoads"."CCC_References_Stage_T" as crst
       INNER JOIN
  "AllRoads"."Pages_T" as pt on (crst."File_Name" = 'CCC.' || pt."File_Name")
ORDER BY pt."Page_Id"
      ,crst.”Reference_Line_Number"

References

[1] Vatican. Catechism of the Catholic Church. Retrieved from:  https://www.vatican.va/archive/ENG0015/_INDEX.HTM