Skip to main content

Using Merge Tools

Prerequisites

  • merge_utilities.top_level_matches: The table in merge_utilities must be populated with data for the provider_merge DAG to work. top_level_matches is a table with two columns. The first column is masterid, which holds the ID value from the highest record in the master hierarchy tree. The second column is non_masterid, which holds the ID value from the highest record in the non-master hierarchy tree.

  • merge_utilities.duplicate_keys: The duplicate_keys table in merge_utilities must be populated with data for the provider_merge DAG to populate the matched_rows table. The table has two columns: the first is migrationid, which is the migration ID of the record; the second is duplicate_key, the value of the field causing the duplication check to fail (e.g., TIN, NPI, name, address).

  • Source Explorer: Source Explorer must be configured with the source name ProviderMerge. Here is an example:

 "ProviderMerge": {
"source_label": "Provider Merge",
"source_schema": "landing",
"objects": {
"account": {
"id_col": "id",
"display_name": "Account",
"fields": {
"id": "ID",
"name": "Name",
"recordtypeid": "Record Type",
"shippingstate": "State",
"parentid": "Parent ID"
},
"facet_fields": ["State", "Record Type"],
"lookups": [
{
"field": "parentid",
"object": "account",
"ref_field": "id"
}
]
},
"contact": {
"id_col": "id",
"display_name": "Contact",
"fields": {
"firstname": "First Name",
"recordtypeid": "Record Type",
"accountid": "Account ID",
"id": "ID",
"v12n__npi_number__c": "NPI"
},
"facet_fields": ["Record Type"],
"lookups": [
{
"field": "accountid",
"object": "account",
"ref_field": "id"
}
]
}
}
}

Optional

  • remap_instructions: By default the provider_merge dag will add the non-masterids from to the utilities.to_smart_delete and in the mainpipeline all records in the non master tree will be marked for smart delete. If there are non master records you want to remap to the master heirarchy or records in the master heirarchy you want to delete you will need to add rows to the remap_instructions table in merge_utilities schema so those records can be moved into the other tree. The table has four columns objectid,object_type, column_name, and new_value.

  • Custom confidence function: If a merge requires certain data to come from one matched record and the rest come from the other record you will need to define a custome confidence function in the internal database. This function will need to be source and object type specific. The function will also need to be added to the config file's confidence_functions list (See Custom Confidence Function Merge Configuration section in config file documentation for more config details). Functions will be used automatically if a record of the defined type and source is in the matched_rows table. The furnctions must take two arguments recordid and object_type. The functions must return a table with the columns rowid,object_type,column_name,multiplier where each row is a different column name from the table landing.<object_type>. Below is an example of a custom confidence function script:

create or replace function raw.source_b_account_merge(
p_recordid uuid,
p_object_type text
) returns table (
rowid uuid,
object_type text,
column_name text,
multiplier int
) language plpgsql as $$
DECLARE
col RECORD;
BEGIN
-- Retrieve the 'phone' or billing-address related columns if they exist in the table landing.<p_object_type>
FOR col IN
SELECT c.column_name
FROM information_schema.columns c
WHERE c.table_schema = 'landing'
AND c.table_name = p_object_type
AND c.column_name IN ('phone', 'billingcity', 'billingstate', 'billingstreet', 'billingpostalcode')
LOOP
-- Return multipliers for the 'phone' and billing columns
RETURN QUERY EXECUTE format(
'SELECT %L::UUID, %L::TEXT, %L::TEXT,
CASE
WHEN %L = ''phone'' THEN 3
WHEN %L IN (''billingcity'', ''billingstate'', ''billingstreet'', ''billingpostalcode'') THEN 1
ELSE NULL
END::INT
FROM landing.%I
WHERE id = %L',
p_recordid,
p_object_type,
col.column_name,
col.column_name,
col.column_name,
p_object_type,
p_recordid::TEXT
);
END LOOP;
END;
$$;

Run Demo Merge

To run the demo project, complete the setup steps and follow the instructions in Running the Demo Project.

After data has been loaded into the landing schema, run the 04_data_quality_and_mastering DAG to copy data into staging and run duplicate record checks.

Using the duplicate record check results, the migration team should populate the two tables in merge_utilities: top_level_matches and duplicate_keys.

Once these tables are populated, you can run the provider_merge DAG. This DAG has tasks that read the two merge_utilities tables and populate the table merge_utilities.matched_rows. Afterward, the dbt project merge_scoring will run, creating a series of intermediate tables in the merge_scoring schema. These tables can be used to debug issues with merge_instructions.

The intermediate tables, in the order they are created, are:

  • merge_scoring.record_pairs_w_source
  • merge_scoring.confidence_multiplier
  • merge_scoring.confidence_multiplier_expanded
  • merge_scoring.confidence_calculated
  • merge_utilities.merge_instructions

At the end of provider_merge, the table merge_utilities.merge_instructions will be populated. You can now run the 04_data_quality_and_mastering DAG. At the very end of provider_merge, a test is run to check if merge_instructions will make changes to an isolated field that is part of a grouped field. If that is the case, the task will fail, causing the DAG to fail. When 04_data_quality_and_mastering is run, the task merge_instructions_pre_check will fail, causing all further tasks to fail.

Additionally, at the end of provider_merge, the non_masterids in top_level_matches are added to utilities.to_smart_delete. This ensures that when 04_data_quality_and_mastering is run, the non-master tree records will be marked for smart delete.