Google Search

Monday, September 30, 2013

Sql query to fill gaps (partitioned outer join)

One of the common scenario in any application would be existing of sparse data and need to fill the missing data.
Below example would show how to findout gaps and fill the data from previous day values.

Lets consider two tables, Stocks (Dimension) and Stock_PRC (Fact Table)

create table stocks(
id number,
ticker varchar2(10)
,company_name varchar2(100)
,descr varchar2(200)
);

create table stocks_prc(
prc_id number
,stocks_id number
,price number
,business_dt date
);

stocks_id has a forign key reference to id column of stocks table.
Let fill some data.

create sequence stockssq;



insert into stocks values (stockssq.nextval,'INFY','Infosys India Ltd', 'regdInfosys India Ltd.Regd');
insert into stocks values (stockssq.nextval,'WIPR','Wipro services Ltd', 'Wipro India Services Ltd.Regd');
insert into stocks values (stockssq.nextval,'TCS','Tata Consultancy', 'Tata Consultancy India ltd');
commit;

create sequence stocks_prcsq
So now Stockid 2 will be infosys,3 TCS and 4 Wipro


insert into stocks_prc (PRC_ID,STOCKS_ID,PRICE,BUSINESS_DT)
select stocks_prcsq.nextval,2,0.34*level,to_date('1-Jan-2013') + level from
dual connect by level < 10;

insert into stocks_prc (PRC_ID,STOCKS_ID,PRICE,BUSINESS_DT)
select stocks_prcsq.nextval,3,0.11*level,to_date('1-Jan-2013') + level from
dual connect by level < 10;

insert into stocks_prc (PRC_ID,STOCKS_ID,PRICE,BUSINESS_DT)
select stocks_prcsq.nextval,4,0.313*level,to_date('1-Jan-2013') + level from
dual connect by level < 10;

commit;
Now this data is dense, which means we have data for all days and no missing data.


Lets delete some days of data to make it sparse for demonstration purpose:
delete from stocks_prc where stocks_id=3 and business_dt ='7-Jan-2013';

Now for Infosys stock we dont have data for 5 Jan and for TCS , 7th Jan data is missing.
Lets try to get the missing data as a first step.

SELECT PRC_ID,STOCKS_ID,PRICE,ALL_DATE
FROM business_dt_tab ,
  (SELECT * FROM stocks_prc where stocks_id=2
  )
WHERE all_date = business_dt(+)
order by ALL_DATE;

PRC_ID
STOCKS_ID
PRICE
ALL_DATE
2
2
0.34
02-01-13 12:00
3
2
0.68
03-01-13 12:00
4
2
1.02
04-01-13 12:00
NULL
NULL
NULL
05-01-13 12:00
6
2
1.7
06-01-13 12:00
7
2
2.04
07-01-13 12:00
8
2
2.38
08-01-13 12:00
9
2
2.72
09-01-13 12:00


But we dont have Stocks_ID info, without which data is not meaningful.
Hence we need to add the partitioned outer join.
Such a join extends the conventional outer join syntax by applying the outer join to each logical partition defined in a query.
Oracle logically partitions the rows in your query based on the expression you specify in the PARTITION BY clause. 
The result of a partitioned outer join is a UNION of the outer joins of each of the partitions in the logically partitioned table 
with the table on the other side of the join.


SELECT STOCKS_ID,PRICE,ALL_DATE
FROM stocks_prc PARTITION BY(stocks_id) right outer join 
  business_dt_tab 
on (all_date = business_dt)
order by ALL_DATE;

From this, we will get the Stocks_id also, thats the speciality of Parition by clause here.

STOCKS_ID
PRICE
ALL_DATE
2
0.34
02-01-13 12:00
2
0.68
03-01-13 12:00
2
1.02
04-01-13 12:00
2
 NULL
05-01-13 12:00
2
1.7
06-01-13 12:00
2
2.04
07-01-13 12:00

Now lets try to fill previous days data using the LAG analytic function.

SELECT STOCKS_ID,
nvl(PRICE,lag(price) over (partition by stocks_id order by all_date)) price
,ALL_DATE
FROM stocks_prc PARTITION BY(stocks_id) right outer join 
  business_dt_tab 
on (all_date = business_dt);

STOCKS_ID
PRICE
ALL_DATE
2
0.34
02-01-13 12:00
2
0.68
03-01-13 12:00
2
1.02
04-01-13 12:00
2
1.02
05-01-13 12:00
2
1.7
06-01-13 12:00
2
2.04
07-01-13 12:00
2
2.38
08-01-13 12:00

References:



3 comments:

  1. what happens when there is more that one null row. your data was null on 5/1. But what if it was null on 6/1 as well. I don't think your approach will work. because lag(null) will return null on the 2nd row.

    ReplyDelete
  2. Better approach is to use "LAST_VALUE" function rather than LAG.

    ReplyDelete