Prescient Analytics

Grouping Temporal Sequences in SQL

08/10/2018 10:45 PM Comment(s) By kpauligk

I was recently working in a PostgreSQL (9.6) environment and had one of those problems that comes around infrequently enough that it isn't a worn pattern in my brain. Contributing to this is the fact that I dwelt for a long time in SQL Server 2008 land which lacked a lot of functionality (lead/lag functions) that have been in the wild for quite a while.


Problem: Temporal transaction data that needed to be grouped into sequences. In this scenario, there is also an additional column by which the data is to be grouped.

The set-up code to demonstrate the approach: 

drop table if exists seq_demo
;
create table seq_demo (
dt date
, grouping character varying
);
insert into seq_demo values ('2018-01-01', 'grp1');
insert into seq_demo values ('2018-01-02', 'grp1');
insert into seq_demo values ('2018-01-04', 'grp1');
insert into seq_demo values ('2018-01-05', 'grp1');
insert into seq_demo values ('2018-01-06', 'grp1');
insert into seq_demo values ('2018-01-03', 'grp2');
insert into seq_demo values ('2018-01-07', 'grp2');
insert into seq_demo values ('2018-01-08', 'grp2');
insert into seq_demo values ('2018-01-09', 'grp2');
insert into seq_demo values ('2018-01-10', 'grp2')
;

The above test data creates four sequences across the two groupings: { 'grp1', 'grp2' }

And we would like to transform this into the following output:

 1 grp1 2018-01-01  2018-01-02
 2 grp1 2018-01-04 2018-01-06
 3 grp2 2018-01-03 2018-01-03
 4grp2 2018-01-07  2018-01-10

This is one approach that makes use of the lag function, following by a cumulative (windowed) sum to create the groups.

with lagged as (
select
dt
, grouping
, lag(dt) over (partition by grouping order by dt) as lagged_dt
-- the rn preserves row order for the next step, which is sometimes desired
, row_number() over (partition by grouping order by dt) as rn
from seq_demo
)
, grouped as (
select
dt
, grouping
, lagged_dt
, (dt - lagged_dt) as days_diff
, case when lagged_dt is null or (dt - lagged_dt) > 1 then 1 else 0 end as changed_dt
, sum(case when lagged_dt is null or (dt - lagged_dt) > 1 then 1 else 0 end)
over (partition by grouping order by rn rows between unbounded preceding and current row) as dt_group
-- note on postgresql "rows between unbounded preceding and current row" was not required

from lagged
)

select
grouping
, dt_group
, min(dt) as start_dt
, max(dt) as end_dt

from grouped

group by
grouping
, dt_group

order by
grouping
, dt_group
;

My results:

The specific use case this solved was to identify isolated transactions that had a group count of 1. 


This pattern also works well to condense repeated dimensional data that appears in a transactional form.

kpauligk

Share -