Introduction
In the previous article related to this project, we were able to take the parsed text data from the Catechism of the Catholic Church (CCC)1 and ingest into a set of staging tables. Now that this is complete, it is on to loading the data into the landing set of tables. I expect this to have some surprises.
I hope at this point, the reality of analytics work is becoming clear. Any project requiring data in an organized fashion demands time. As you become more familiar with these processes, you will become quicker at it yet the work still must be done. Shortcuts are not recommended.
Approach
For this project, I decided not to use tools. This is old school and has rough edges. At the same time, it is also practical. Knowing that these loads are one and done in their approach makes it direct. Sometimes setting up a tool is a project onto itself. Using built-in functionality can be a better choice.
If you read other articles on ETL processes, you know there are some transformations that will occur. These are usually done on during the load step. The important difference is that the addition of Ids and relationships are critical. Referential integrity building is not always straightforward.
Fortunately, the previous steps have helped prepare the entirety of data. We will have to consider how the individual records will meet the data architecture. This is where a good set of Source To Target Mappings (STTM) helps drastically. This will keep the thoughts from getting muddied as development happens.
Special Cases
Before going further, I will talk about the special cases. Texts_T is the first one. This is a very straightforward single record but important. This is the root of the hierarchy. All pages tie into this single text record. Since it is so simplistic, a quick single row insert will be used.
Next are the chapters. The layout for the CCC data was a bit more challenging than expected. For now, it is in a pause state. Even though it is an optional component, I will revisit this organization effort at another date.
The other special case is the abbreviations load. The source file is loaded directly using PostgreSQL’s pgAdmin. There is nothing fancy going on here so why complicate it.
Additions
When designing the load process, I had to go back and add-in an additional field. This is important to note as no design remains concrete. In this case, the source File_Name (data type set to varchar 1000) is added to Pages_T. This is needed to get the Page_Id for the other loads.
I also considered adding in Paragraph_Id to References_T. This seemed like a good way to tie the 2 together at first. After thinking about the order of the loads, I decided to avoid the possibility of circular references downstream.
Design
The design is primarily held in the following STTMs. Wherever necessary, I will add more detail. The resulting Data Manipulation Language (DML) is in the appendix section.
Reject Records
In a project where the process is executed on a schedule, we would need to be concerned with capturing the records that do not load according to the rules. They would be stored in a separate set of tables, typically called the reject tables, for further analysis. The load process would need to take that into consideration at each and every execution.
In this project, the loads are single execution. Any updates to the text will be done in a separate process whenever the next version comes out.
Order of loads
There needs to be an organized process for these loads due to Id dependency. The Page_Id is required by paragraphs and references so Pages must come before them. Also, Text_Id is required so that must come before all of them. The order of load will be:
- Texts_T
- Pages_T
- Paragraphs_T
- References_T
Load to Texts_T
As stated earlier, this table is a single record. This is to be entered manually.
| Source Table | Source Field | Data Type | Transformation | Target Table | Target Field | Date Type |
| Set to “1” | Texts_T | Text_Id | integer | |||
| Set to “Catechism of the Catholic Church” | Texts_T | Title | character varying (1000) | |||
| Set to “Catechism of the Catholic Church” | Texts_T | Description | text | |||
| Set to “https://www.vatican.va/archive/ENG0015/_INDEX.HTM“ | Texts_T | URL | character varying (2048) | |||
| Set to NULL | Texts_T | Group | character varying (50) | |||
| Set to “Religious Reference” | Texts_T | Type | character varying (50) | |||
| Set to CURRENT_DATE | Texts_T | Audit_Create_Date | timestamp | |||
| Set to “Andy” | Texts_T | Audit_Create_User | character varying (100) | |||
| N/A | Texts_T | Audit_Update_Date | timestamp | |||
| N/A | Texts_T | Audit_Update_User | character varying (100) | |||
| *Note: Single record insert |
Load to Pages_T
The pages load comes directly from the staging area without any joins. The item to remember is to use the Text_Id of 1 for CCC. For Page_Id, remember that the serial generation feature of PostgreSQL has been used. This is going to be the same for the remaining tables.
| Source Table | Source Field | Data Type | Transformation | Target Table | Target Field | Date Type |
| System Generated ID | Pages_T | Page_Id | integer | |||
| Set to ROW_NUMBER() by sorting on the File_Name. | Pages_T | Page_Sequence_Id | integer | |||
| CCC_Pages_Stage_T | TItle | character varying (1000) | Pages_T | Title | character varying (1000) | |
| Set to NULL | Pages_T | Description | text | |||
| CCC_Pages_Stage_T | URL | character varying (2048) | Pages_T | URL | character varying (2048) | |
| CCC_Pages_Stage_T | File_Name | character varying (1000) | Pages_T | File_Name | character varying (1000) | |
| Set to 1 for “CCC” Text_Id | Pages_T | Text_Id | integer | |||
| Set to NULL | Pages_T | Chapter_Id | integer | |||
| Set to CURRENT_DATE | Pages_T | Audit_Create_Date | timestamp | |||
| Set to “Andy” | Pages_T | Audit_Create_User | character varying (100) | |||
| N/A | Pages_T | Audit_Update_Date | timestamp | |||
| N/A | Pages_T | Audit_Update_User | character varying (100) |
Load to Paragraphs_T
The paragraphs load is more than a straight load. It requires a join back to Pages_T. This is necessary to get the Page_Id for the page the paragraph is on. Text_Id can be manually set. For clarity, sorting is down to organize the order stored in the table.
There is also the conversion of the reference Ids from a delimited string to JSON format. This is where PostgreSQL’s functions come into play. It is simple matter of calling the function after parsing out the delimited string.
| Source Table | Source Field | Data Type | Transformation | Target Table | Target Field | Date Type |
| System Generated ID | Paragraphs_T | Paragraph_Id | integer | |||
| CCC_Paragraphs_Stage_T | Paragraph_Id | bigint | When Paragraph_Id = ‘-1’ then set to NULL | Paragraphs_T | Paragraph_Sequence_Id | character varying (1000) |
| CCC_Paragraphs_Stage_T | Paragraph_Text | text | Paragraphs_T | Paragraph_Text | text | |
| CCC_Paragraphs_Stage_T | URL | character varying (2048) | Paragraphs_T | URL | character varying (2048) | |
| CCC_Paragraphs_Stage_T | References | text | Convert comma delimited string to json using to_json | Paragraphs_T | Reference_Sequence_Ids | json |
| Set to 1 for “CCC” Text_Id | Paragraphs_T | Text_Id | integer | |||
| CCC_Paragraphs_Stage_T | File_Name | character varying (1000) | Lookup Page_Id from Pages_T by joining to the number portion of the File_Name | Paragraphs_T | Page_Id | integer |
| Set to CURRENT_DATE | Paragraphs_T | Audit_Create_Date | timestamp | |||
| Set to “Andy” | Paragraphs_T | Audit_Create_User | character varying (100) | |||
| N/A | Paragraphs_T | Audit_Update_Date | timestamp | |||
| N/A | Paragraphs_T | Audit_Update_User | character varying (100) | |||
| *NOTE: Sort by Paragraph_Line_Number in an ascending order |
Load to References_T
The references load is similar to the paragraphs load with Page_Id and Text_Id. A sort is also completed to assist in organization.
For future thought, there should be a better use of Reference_Id in the data architecture. This might assist in join speed.
| Source Table | Source Field | Data Type | Transformation | Target Table | Target Field | Date Type |
| System Generated ID | References_T | Reference_Id | integer | |||
| CCC_References_Stage_T | Reference_Id | text | References_T | Reference_Sequence_Id | character varying (1000) | |
| CCC_References_Stage_T | Reference_Text | text | References_T | Reference_Text | text | |
| Set to NULL | References_T | URL | character varying (2048) | |||
| CCC_References_Stage_T | File_Name | character varying (1000) | Lookup Page_Id from Pages_T by joining to the number portion of the File_Name | References_T | Page_Id | integer |
| Set to 1 for “CCC” Text_Id | References_T | Text_Id | integer | |||
| Set to CURRENT_DATE | References_T | Audit_Create_Date | timestamp | |||
| Set to “Andy” | References_T | Audit_Create_User | character varying (100) | |||
| N/A | References_T | Audit_Update_Date | timestamp | |||
| N/A | References_T | Audit_Update_User | character varying (100) | |||
| *Note: Sort by Page_Id and Reference_Line_Number in an ascending order |
Testing Results
Always remember to do some sort of data validation! In this phase, we will rely on record counts as well as checking join conditions.
Record Counts
Since there are no aggregations being done, there should be a 1:1 between source and target tables.
| Source | Count | Target | Count |
| CCC_Pages_Stage_T | 374 | Pages_T | 374 |
| CCC_Paragraphs_Stage_T | 4507 | Paragraphs_T | 4507 |
| CCC_References_Stage_T | 3684 | References_T | 3684 |
The numbers line up 1:1 so we can move on to the next test.
Orphaned Joins
An orphan is a record that has no parent record. In some tables, this is not needed. In this project, there need to be joins back to the Pages_T and Text_T tables. This requires testing the joins.
Test for Text_Id is simply making sure the value of 1 is in all records.
SELECT count(*) FROM "AllRoads"."Pages_T" where "Text_Id" <> 1
0 records
SELECT * FROM "AllRoads"."Paragraphs_T" where "Text_Id" <> 1
0 records
SELECT count(*) FROM "AllRoads"."References_T" where "Text_Id" <> 1
0 records
All Text_Ids are set to 1. Again, simple since we have only 1 text.
Testing for Page_Id is more complicated. It needs to make sure that the Id exists and that is as expected. To do this, we should do some selects checking for missing joins. Starting with Paragraphs, check for any missing joins to Pages_T.
SELECT count(*)
FROM "AllRoads"."Paragraphs_T" pt
left outer join
"AllRoads"."Pages_T" pgt
on (pt."Page_Id" = pgt."Page_Id")
where pt."Page_Id" is null
0 records
That’s good. There are no missing joins. Let’s double check the authenticity by making sure there are no fallout records when joining.
SELECT count(*)
FROM "AllRoads"."Paragraphs_T" pt
inner join
"AllRoads"."Pages_T" pgt
on (pt."Page_Id" = pgt.”Page_Id")
4507 records
That’s also good. Let’s repeat for the References_T
SELECT count(*)
FROM "AllRoads"."References_T" rt
left outer join
"AllRoads"."Pages_T" pt
on (rt."Page_Id" = pt."Page_Id")
where pt."Page_Id" is null
0 records
SELECT count(*)
FROM "AllRoads"."References_T" rt
inner join
"AllRoads"."Pages_T" pt
on (rt."Page_Id" = pt."Page_Id")
3684 records
These may seem like basic tests but they are essential, as they prove the joins are working.
Spot check
As one final precaution, let’s do some joins and spot check.
SELECT tt."Title", pt."Title", rt."Reference_Text"
FROM "AllRoads"."References_T" rt
inner join
"AllRoads"."Pages_T" pt
on (rt."Page_Id" = pt."Page_Id")
inner join
"AllRoads"."Texts_T" tt
on (rt."Text_Id" = tt.”Text_Id")
Here are the results (5 rows only shown).
| Catechism of the Catholic Church | PROLOGUE | Jn 17 3 |
| Catechism of the Catholic Church | PROLOGUE | 1 Tim 2:3-4. |
| Catechism of the Catholic Church | PROLOGUE | Acts 4:12 |
| Catechism of the Catholic Church | I. The life of man – to know and love God | Mt 28:19-20 |
| Catechism of the Catholic Church | I. The life of man – to know and love God | Mk 16:20” |
This is as expected. Now, let’s put it all together with Pages, Paragraphs, and References.
select p."Page_Title"
,p."Paragraph_Sequence_Id"
,rt."Reference_Text"
from (
select pgt."Title" as "Page_Title"
,pgt."Page_Id"
,pt."Paragraph_Sequence_Id"
,json_array_elements_text(pt."Reference_Sequence_Ids")::int::text as "Reference_Sequence_Id"
from "AllRoads"."Paragraphs_T" pt
inner join
"AllRoads"."Pages_T" pgt
on (pt."Page_Id" = pgt."Page_Id")
) p
inner join
"AllRoads"."References_T" rt
on (p."Reference_Sequence_Id" = rt."Reference_Sequence_Id"
and p."Page_Id" = rt."Page_Id")
order by p."Page_Id", p."Paragraph_Sequence_Id", rt.”Reference_Sequence_Id"
Here are the results (5 rows only shown).
| I. The life of man – to know and love God | 2 | Mt 28:19-20 |
| I. The life of man – to know and love God | 2 | Mk 16:20 |
| I. The life of man – to know and love God | 3 | Cf. Acts 2:42 |
| II. Handing on the Faith: Catechesis | 10 | Extraordinary Synod of Bishops 1985,. Final Report II B a, 4. |
| II. Handing on the Faith: Catechesis | 10 | John Paul II, Discourse at the Closing of the Extraordinary Synod of Bishops 7 December 1985: AAS 78, (1986).” |
That’s saying a lot in the results. That query shows that the architecture works and the results match what is in the CCC. I’m feeling confident at this point that the project is heading in the right direction.
Next Steps
Now that the CCC has been loaded into the database, it’s on to the Bible. I can see the challenge with this text being related to the many books contained within it. There are numerous files to handle. I tried to prepare them into a useable format but let’s see where it leads.
Thank you
Appendix
DML
Load Pages_T from CCC_Pages_Stage_T
INSERT INTO "AllRoads"."Pages_T"
(
"Page_Sequence_Id"
,"Title"
,"Description"
,"URL"
,"File_Name"
,"Text_Id"
,"Chapter_Id"
,"Audit_Create_Date"
,"Audit_Create_User"
)
SELECT ROW_NUMBER() OVER(ORDER BY CAST("File_Name" as INTEGER)) AS Page_Sequence_Id
,"Title"
,NULL as Description
,"URL"
,"File_Name"
,1 as Text_Id
,NULL as Chapter_Id
,CURRENT_DATE as Audit_Create_Date
,'Andy' as Audit_Create_User
FROM "AllRoads"."CCC_Pages_Stage_T"
Load Paragraphs_T from CCC_Paragraphs_Stage_T
INSERT INTO "AllRoads"."Paragraphs_T"
(
"Paragraph_Sequence_Id"
,"Paragraph_Text"
,"URL"
,"Reference_Sequence_Ids"
,"Text_Id"
,"Page_Id"
,"Audit_Create_Date"
,"Audit_Create_User"
)
SELECT case
when cpst."Paragraph_Id" = '-1' then null
else cpst."Paragraph_Id"
end as Paragraph_Sequence_Id
,"Paragraph_Text"
,"URL"
,to_json(string_to_array("References",',')) as Reference_Sequence_Ids
,1 as Text_Id
,pt."Page_Id"
,CURRENT_DATE AS Audit_Create_Date
,'Andy' as Audit_Create_User
FROM "AllRoads"."CCC_Paragraphs_Stage_T" as cpst
INNER JOIN
"AllRoads"."Pages_T" as pt on (cpst."File_Name" = 'CCC.' || pt."File_Name")
ORDER BY pt."Page_Id", cpst.”Paragraph_Line_Number"
Load References_T from CCC_References_Stage_T
INSERT INTO "AllRoads"."References_T"
(
"Reference_Sequence_Id"
,"Reference_Text"
,"Page_Id"
,"Text_Id"
,"Audit_Create_Date"
,"Audit_Create_User"
)
SELECT "Reference_Id" as Reference_Sequence_Id
,COALESCE("Reference_Text", 'No Reference Text') as Reference_Text
,pt."Page_Id"
,1 as Text_Id
,CURRENT_DATE AS Audit_Create_Date
,'Andy' as Audit_Create_User
FROM "AllRoads"."CCC_References_Stage_T" as crst
INNER JOIN
"AllRoads"."Pages_T" as pt on (crst."File_Name" = 'CCC.' || pt."File_Name")
ORDER BY pt."Page_Id"
,crst.”Reference_Line_Number"
References
[1] Vatican. Catechism of the Catholic Church. Retrieved from: https://www.vatican.va/archive/ENG0015/_INDEX.HTM
