Monday, April 28, 2014

A methodical approach to estimation and analysis

Our company is considering using some of SQL Server 2008 R2's features to reduce loading times from the product database to the data warehouse. Specifically, we are looking at Change Tracking (CT) or Change Data Capture (CDC). In a perfect world it would be easy to say CDC is the option. However, we face some hurdles.
  • We have Enterprise Edition in production but our "pre-prod" environment is Standard Edition. Implementing something like CDC is an increase in cost in licensing to upgrade the pre-prods (that's a whole 'nother story)
  • We have limited storage
  • The storage budget has already been allocated for this year
My task: Estimate the amount of additional storage required for CDC.


There aren't many, at least that I've found, websites, blogs or statistics on how much additional storage CDC requires so I needed to perform that analysis and estimates myself.


Since CDC creates a change tracking table for each table I realized I could get the approximate size of a row for each of those tables. Then, all I would need to know were the approximate number of rows changed each day.


Step 1) I enabled CDC for the source tables required for the data warehouse
if (select is_cdc_enabled from sys.databases where name = 'mySaaSDB') = 0

BEGINPRINT 'Enabling CDC'
EXEC sys.sp_cdc_enable_db
END
PRINT 'CDC Already enabled'
EXEC sys.sp_cdc_enable_table


@source_schema = N'dbo'
, @source_name = N'tblTask'
, @role_name = NULL -- N'MyRole'
, @Capture_Instance = 'tblTask_CDC' -- May be omitted
, @filegroup_name = N'Primary'
, @supports_net_changes = 0 -- 0 for all changes, 1 for all changes and net changes






-- , @Captured_Column_List = '' -- NULL or omit for all columns. Not the best practice
, @Captured_Column_List =  'EmployerId,
TaskId, TaskTypeId, TaskStatusId, CriticalScore, TaskPriorityId, StartDate, DueDate, StartedDate, CompletedDate,
CancelledDate, EnteredDate, EnteredBy, ChangedDate, ChangedBy, LeaveId, EmployeeId'
GO
Then, I used the system view to get the columns, data type and size of the base table and the CT table.
select T.name, C.name, ty.name, C.max_length
from
sys.tables T
JOIN sys.columns C
on C.object_id = T.object_id
join
sys.types ty
on ty.system_type_id = c.system_type_id

whereT.name = 'tblTask'


select T.name, C.name, ty.name, C.max_length
from
sys.tables T
JOIN sys.columns C
on C.object_id = T.object_id
join
sys.types ty
on ty.system_type_id = c.system_type_id

whereT.name = 'tblTask_CDC_CT'

Since most of the columns collected for CDC are precise data types it's easy to get an estimate for how big each row in the CT table will be and compare that to the base table.


Now, all I need to do is get the number of rows changed each day. For our application, most changes are due from INSERT operations. Typically, just new rows are added (although for some tables, when a new row is added for a task the original row is updated by nulling out an ineffectedDate column).


I had tried using ChangedDate and EnteredDate columns to get an average number of changes per day but I just wasn't getting any warm fuzzy's from that.


While doing my analysis, I had used the sp_spaceused procedure to get number of rows and data size for each table and column and calculating the average size of each row.


Then it dawned on me: sp_spaceused tells me exactly how many rows are in the table.


If I run that each day I can then compare the each day's row count and get a delta.
Rows Changed * CDC CT row table size = size of table


Summing that for all tables I can get an accurate bottom estimate for how much additional storage will be required to enable Change Data Capture. I say bottom estimate because I still don't know how many rows actually changed but it's a start.


No comments:

Post a Comment