Prescient Analytics

Handling "Valid" SQL Join Duplication

06/12/2019 12:45 PM Comment(s) By kpauligk

A hump that most people starting out in SQL have to get over is how joins work - that is, thinking relational.

A scenario that came up recently involved matching payments in two different systems, the purpose of which was to trace it back to a source vendor. 


An issue I experienced was the case of “valid” duplicate joining – that is there were two transactions in one table and two in the other and both had the same join column values. Therefore the result with relational joining is that four records are returned since they all join to each other, rather than the desired two.


One way to remove these duplicates is to use good old row_number() to generate a pseudo-key to filter them out.


This is somewhat arbitrary depending on the data, since the only control you have is in the order by clause of the row_number() function. In my use case this was acceptable since other key conditions were met in the join clause.

Below is an example demonstrating this approach:

create temporary table test_legit_matches_1 (

    idx1 int,

    col1 text,

    join_col text

);


create temporary table test_legit_matches_2 (

    idx2 int,

    col2 text,

    join_col text

);


insert into test_legit_matches_1 values (1, 'val1', 'join me');

insert into test_legit_matches_1 values (2, 'val2', 'join me');

insert into test_legit_matches_1 values (3, 'val3', 'nice join');

insert into test_legit_matches_2 values (1, 'val4', 'join me');

insert into test_legit_matches_2 values (2, 'val5', 'join me');

insert into test_legit_matches_2 values (3, 'val6', 'nice join');


-- observed duplication

select

    *

from test_legit_matches_1 as t1


left outer join test_legit_matches_2 as t2

on t1.join_col = t2.join_col

;


-- duplication removed, though join is potentially arbitrary (based on order by clauses)

with _joined as (

    select

        *

        , row_number() over (partition by idx1 order by idx2) as idx1_rn

        , row_number() over (partition by idx2 order by idx1) as idx2_rn


    from test_legit_matches_1 as t1


    left outer join test_legit_matches_2 as t2

    on t1.join_col = t2.join_col


)


select *

from _joined

where idx1_rn = idx2_rn

;

Result 1:


Result 2:


kpauligk

Share -