Monday, April 28, 2014

A methodical approach to estimation and analysis

Our company is considering using some of SQL Server 2008 R2's features to reduce loading times from the product database to the data warehouse. Specifically, we are looking at Change Tracking (CT) or Change Data Capture (CDC). In a perfect world it would be easy to say CDC is the option. However, we face some hurdles.
  • We have Enterprise Edition in production but our "pre-prod" environment is Standard Edition. Implementing something like CDC is an increase in cost in licensing to upgrade the pre-prods (that's a whole 'nother story)
  • We have limited storage
  • The storage budget has already been allocated for this year
My task: Estimate the amount of additional storage required for CDC.


There aren't many, at least that I've found, websites, blogs or statistics on how much additional storage CDC requires so I needed to perform that analysis and estimates myself.


Since CDC creates a change tracking table for each table I realized I could get the approximate size of a row for each of those tables. Then, all I would need to know were the approximate number of rows changed each day.


Step 1) I enabled CDC for the source tables required for the data warehouse
if (select is_cdc_enabled from sys.databases where name = 'mySaaSDB') = 0

BEGINPRINT 'Enabling CDC'
EXEC sys.sp_cdc_enable_db
END
PRINT 'CDC Already enabled'
EXEC sys.sp_cdc_enable_table


@source_schema = N'dbo'
, @source_name = N'tblTask'
, @role_name = NULL -- N'MyRole'
, @Capture_Instance = 'tblTask_CDC' -- May be omitted
, @filegroup_name = N'Primary'
, @supports_net_changes = 0 -- 0 for all changes, 1 for all changes and net changes






-- , @Captured_Column_List = '' -- NULL or omit for all columns. Not the best practice
, @Captured_Column_List =  'EmployerId,
TaskId, TaskTypeId, TaskStatusId, CriticalScore, TaskPriorityId, StartDate, DueDate, StartedDate, CompletedDate,
CancelledDate, EnteredDate, EnteredBy, ChangedDate, ChangedBy, LeaveId, EmployeeId'
GO
Then, I used the system view to get the columns, data type and size of the base table and the CT table.
select T.name, C.name, ty.name, C.max_length
from
sys.tables T
JOIN sys.columns C
on C.object_id = T.object_id
join
sys.types ty
on ty.system_type_id = c.system_type_id

whereT.name = 'tblTask'


select T.name, C.name, ty.name, C.max_length
from
sys.tables T
JOIN sys.columns C
on C.object_id = T.object_id
join
sys.types ty
on ty.system_type_id = c.system_type_id

whereT.name = 'tblTask_CDC_CT'

Since most of the columns collected for CDC are precise data types it's easy to get an estimate for how big each row in the CT table will be and compare that to the base table.


Now, all I need to do is get the number of rows changed each day. For our application, most changes are due from INSERT operations. Typically, just new rows are added (although for some tables, when a new row is added for a task the original row is updated by nulling out an ineffectedDate column).


I had tried using ChangedDate and EnteredDate columns to get an average number of changes per day but I just wasn't getting any warm fuzzy's from that.


While doing my analysis, I had used the sp_spaceused procedure to get number of rows and data size for each table and column and calculating the average size of each row.


Then it dawned on me: sp_spaceused tells me exactly how many rows are in the table.


If I run that each day I can then compare the each day's row count and get a delta.
Rows Changed * CDC CT row table size = size of table


Summing that for all tables I can get an accurate bottom estimate for how much additional storage will be required to enable Change Data Capture. I say bottom estimate because I still don't know how many rows actually changed but it's a start.


Wednesday, April 16, 2014

The importance of publishing

I presented at the Boulder SQL Server User group recently on "Taking your Skills to the Next Level". In my slides I talk about the importance of reading and staying up-to-date and not stagnating. Particularly with today's IT workforce, it's easy to be overwhelmed with day-to-day operations, fire-storms and "sev-1" outages and get sucked into a "just get it done and we'll refactor it to be more efficient later" mindset. While that may be productive short term, there are longer term stakes involved.


Many times that attitude results in a large amount of technical debt that rarely gets "fixed later" except when something breaks. Additionally, it inhibits growth.


So it is important to know what options are available in order to find an appropriate solution. And part of knowing what those options are involve reading. But, reading is not enough. It's great to know that "there's a new feature we can leverage" but has anyone ever implemented it? Has it ever been tested or validated that, yes, it will work in the company's environment?


Which brings up the second part of the presentation which is writing: create a proof of concept  (POC), write an in-house training for a lunch and learn series, and blog.


The problem with blogging is that, particularly in the SQL Server world, there are many brilliant MVPs who have already beaten us to the punch. Does that mean that us lesser SQL professionals don't have anything to say or dispense? We do. So instead of filling up the Interwebs and Blogosphere with our voice, I recommend starting or using an internal company blog site.


With an internal blog we, the soldiers on the front line, can espouse our views and knowledge to those closest to us while at the same time expanding our own knowledge.


So write my fellow database administrators, developers and architects. Find your voice and refine your audience and, ultimately, expand your own knowledge and grow.

Friday, January 31, 2014

Self documentation and extended properties - Building a Data Dictionary

In many of the environments I've worked, the amount of documentation around databases, schemas and objects has been, how shall I say it, somewhat limited. Most of the "self documentation" that I've seen has been within stored procedures and functions explaining purpose and reason for the code.
An example (from the AdventureWorks database)  follows.

CREATE PROCEDURE [dbo].[uspGetManagerEmployees]


@BusinessEntityID [int]
AS

BEGIN
SET NOCOUNT ON;

-- Use recursive query to list out all Employees required for a particular Manager

WITH [EMP_cte]([BusinessEntityID], [OrganizationNode], [FirstName], [LastName], [RecursionLevel]) -- CTE name and columns

AS (

SELECT e.[BusinessEntityID], e.[OrganizationNode], p.[FirstName], p.[LastName], 0 -- Get the initial list of Employees for Manager n

FROM [HumanResources].[Employee] e

INNER JOIN [Person].[Person] p

   ... (code omitted for brevity)

UNION ALL

SELECT e.[BusinessEntityID], e.[OrganizationNode], p.[FirstName], p.[LastName], [RecursionLevel] + 1 -- Join recursive member to anchor

... (code omitted for brevity)

)

-- Join back to Employee to return the manager name

... (code omitted for brevity)


What about documentation for other schema objects? Are there data dictionaries? Many places I've worked, or consulted, may have some external documentation including ER diagrams and data modeling but that's it. I've found very little internal documentation on database objects. Ask a question like "what are all the transactional tables" and you'll get an answer like "Go ask Bill. He designed that". There's a lot of information and documentation that's in someone's brain.


A recent issue at work brought this to light. We were migrating a hub and spoke model database to a multi-tenant model database. It's not as simple as a "backup-copy-restore" process as we needed to preserve the data in both databases. Add to the complexity is that both databases use identity columns for the primary keys. That means when migrating to the multi-tenant db those id columns will change. The due-diligence was performed, data migrated and boom. It all blew up. Emails sent to the migrated client used the wrong client template.


Eyebrows were raised, questions asked, "How did this happen?". "How can we prevent this from happening again?"


Well, there is a way. But it requires SOME knowledge of the schema. It's possible to query the metadata from the DMVS to get tables that have foreign keys, the tables that reference those foreign key tables, the columns referenced and the primary keys. Then, create a checksum on the non-data fields (dates tend to change frequently and will change the checksum) and compare the rows for deltas. But wait...
Problem #1 - Because the identity values changed, every row came back as a delta. That's thousands of rows to check manually.
Problem number #2 - What were we actually trying to compare? That is the real issue.


For example... The Person table has a foreign key to the Address table as each employee has an address. And the Address table has a foreign key to the State and Country tables.
In these cases, the Address table is more of a transactional table while the State and Country tables are lookup tables.
In our "Oops" example, it was the Letters and LettersTemplate tables. The task then is to "find lookup tables, compare the values for the client (or employer) and make sure that the value in the multi-tenant database is the same value as the original database.


So we have way to find child tables referenced by parent tables. But what are the values we need to compare? In some tables the column containing the lookup value is called SystemCode. In others, it's Type, or Description, or Status. There is nothing consistent and there was no documentation other than "check with ".


Enter the world of Extended Properties! This is my "go-to" tool for self documenting a database.
Properties can be added at
  • The database or Schema level (@level0Type)
  • Object level  such table, view or procedure (@level1Type)
  • Column level (@Level2Type)
Properties at each level have a name and value as well


I sat down and went through each of the tables referenced by foreign key and made a list based on my "knowledge" of the application, table name, column names. Then I sat down with the application architect and filtered out tables that were transactional in nature (like Address).




In our case, we used _Property to differentiate them from default supplied Microsoft properties (MS_Description).
Then for the @Value it depended on what we're looking for
  • LookupTable for lookup tables
  • LookupKey to identify the column(s) used as the foreign/lookup key for that table
  • LookupValue to identify the column(s) that contain the actual lookup values
This is what I wound up with:
EXEC sp_addextendedproperty
@name = N', @value = 'LookupTable',
@level0type = N'Schema', @level0name = dbo, -- dbo, Sales, HumanResources, etc
@level1type = N'Table', @level1name = tblBenefitType -- Name of table that is a lookup table

GO
EXEC sp_addextendedproperty
@name = N'CoName_Property', @value = 'LookupKey',
@level0type = N'Schema', @level0name = dbo, -- dbo, Sales, HumanResources, etc
@level1type = N'Table', @level1name = tblBenefitType, -- Name of table that is a lookup table
@level2type = N'Column', @level2name = BenefitTypeId; -- Name of column that contains the key value

GO
EXEC sp_addextendedproperty -- Will need to be executed for each column that comprises a lookup value
@name = N'CoName_Property', @value = 'LookupValue',
@level0type = N'Schema', @level0name = dbo, -- dbo, Sales, HumanResources, etc
@level1type = N'Table', @level1name = tblBenefitType, -- Name of table that is a lookup table
@level2type = N'Column', @level2name = BenefitType; -- Name of column that contains the value pair

GO


Now it's just s simple matter of querying the sys.extended_properties system view to find lookup tables, their keys and values, and using those values in a dynamic query that generates a checksum to compare against databases.

Wednesday, January 29, 2014

Repost: Why I work here!

My boss posted this today on his blog about "Why aren't you here yet". It's about our company, The Reed Group, and why you should work here.


I figured out his first 4 points long ago. It's the other 4 that helped bring me to Reed Group and to realize that yes, this IS the place to be!


One thing he forgot to mention: We innovate.


Reed Group is not just a software company. We're a technology company. What does that mean?


  • Don't just bang out code - I've been a web or application developer, a DBA and now a databases architect and I've written thousands of lines of ASP, .Net, java and T-SQL code. In all those years as I've written that code, I asked myself "Is this the best way to do this?". It may solve the problem but does it scale? That's the talent we hire. Forward thinkers. Those that know not just how to solve a problem, but know their tools intently, know what else is out there and look at different ways and methods to solve it.
  • Technology Integration - Yes, I'm a SQL Server database architect. I know it inside out. I know it from the storage subsystem all the way through memory and the engine. I know what's in the current release, the previous release and what's coming in the next releases. I can make informed decisions and recommendations so that it not only meets Reed Group's current needs but carries us forward into the next 3 to 5 years. Does that mean it's always the best tool? Maybe. Maybe not. I'm not so tied to the platform that I can't find a better way to make us successful in the future. And, as architect, I ensure that all aspects of our products work seamlessly. That goes from our core application, to the data warehouse, phone/IVR system, etc. It's not just about making the best Leave Management software, it's also about making the best tools available to help companies manage their leaves. That means that Reed Group isn't just banging out the same Software as a Service (Saas) or web product as well but how can it work with tablets and mobile devices to integrating leave management with normative data to provide accurate predictive models for those leaves.

Friday, December 23, 2011

The Cowboy or rogue DBA/Developer

“I don’t always test my code, but then I do, I use Production” (parody of the Most Interesting Man in the World Dos Equis commercial).


A lot has been written about version control and the process of promoting code through environments. (See coding Horror and When Database Source Control Goes Bad). And, in many shops the DBA will find the development/test environment, then (hopefully) a QA and/or stage environment then finally, their production environment. Along with these different environments there is (in good IT shops) a process that defines promotion of code from one to the next. That is,

• make changes to source code in dev\test

• Compile code and unit test

• Promote code to QA

• QA group performs integration and regressions testing

• QA either returns code to dev or promotes to UA or a stage environment

• UA group validates functional and non-functional requirements and either returns to development or a Release Manager schedules a code release or go-live date to perform the final promotion

For non-scripted code (C++, C#, java, etc), where the source code must be compiled into object code or an executable, promotion is straight forward and limits the damage a developer can do from one environment to the next. That is, in their development or test area, their C++ code is compiled then unit tested. The SOURCE code is not promoted but the compiled bits are. Even if the source is moved to each environment, the ability to compile is limited so is the ability to affect a more permanent change. This prevents code from being modified outside of the release cycle and environments from getting out of synch.



For scripting environment (and I consider SQL Server such a case) it is much more difficult. Even with a process in place as listed above, due to the nature of T-SQL, it is easy for someone with sufficient privileges to make a change directly to the database. For example, consider a complex procedure that creates and loads temporary tables for some type of row-by-row processing (another article down the road) and one of the temp tables is missing an index. In the development environment it may have been missed and not a performance issue due to a relatively low number of records in the table. However, when promoted to a QA or UA environment, and yes, even production where datasets are often much larger, a performance issue is encountered. Someone recognizes that the orders tab le needs a non-clustered index on the OrderDate column because the query is performing a table scan and reading every row of the 500 million row table.



At this point a defect should be raised and code sent back for remediation. This is where the rogue or cowboy DBA\developer comes into play. Rather than follow the proscribed process, the index is added, the test is “passed” and code promoted.



Now there’s a problem. Even if the environments are source controlled, the underlying databases are now out of sync. The index is not in source control and it’s not in the development environment. And in the “worst case” scenario, the release manager gives the green light for a release date and code gets promoted to production (without) the necessary index. In the meantime, a patch or additional feature release gets promoted to QA. Remember, the change was made outside of development and source control so the required index is not included. Therefore the code promotion overwrites the “fix” that was put in place. And then guess what? That’s usually the time production, to put it mildly, “goes to shit”. The web cart is not responding, pages are timing out and some exec is screaming “Heads will roll! How did this mess get promoted? Who approved this?”



Since the each environment is now synched, it’s hard to find the breakdown. Even worse, there’s still the possibility of that cowboy DBA doing one of the following (which usually involves making direct changes to production):

• Remembering that they added the index to QA and quietly goes into production and changing it in order to cover their tracks (yes, I’ve seen it done)

• Some other cowboy DBA recognizes that there’s a missing an index and decides to perform the quick fix themselves



For these 2 cases, we wind up with unsynchronized environments again. The “fix” is in production but not in version control, or any lower system. Therefore when the next promotion occurs, it is likely that whatever change was made in production will be overwritten again.



How can you combat the cowboy DBA? The simplest solution involves locking down each environment. That is, only the QA DBAs can make changes in QA. Only production or operations DBAs can make\promotes changes in production. This has a few limitations. One, it requires additional resources. The reality is that usually one or two DBAs are responsible for all the environments from dev through production so the likelihood of the cowboy DBA is higher.

Since most shops are small and companies can’t (or won’t) shell out for additional resources it then requires other solutions. Database Auditing and Audit Specification s can be a big help. While they can’t prevent changes altogether (after all SOMEONE has to make changes) they do provide tools to see WHO is making changes. They can be used to log who is modifying live data, who performed a DDL operation (alter table, alter\create procedure, etc). When these event don’t correspond to release dates, then there is data to “gently reprimand” the offending cowboy for violating the “proscribed release methods:



Monday, April 18, 2011

SQL Saturday 74

Update: Too much travel. Don't know where I am or what weeke it is. SQL Sat 74 is not THIS coming weekend but the FOLLOWING. April 30th.

Had a blast at SQL Saturday 63 in Dallas on the 2nd. Thanks to everyone who had both positive and negetive feedback. Taking it to Jacksonville this weekend and looking to take it up a notch! See you there!

What is this cache and how did it get so big?

Friday, March 4, 2011

SQL Server Denali tidbits

So yesterday I was using some DMVs and checking space allocated in tempdb for user object, internal objects and version store, the sys.dm_db_file_space_usage view in particular. With it you can see unallocated page extents and how many pages are used by user and internal objects and the version store. On a whim I ran the same query query on my Denali instance and got two extra columns! These are Total Page Count and Allocated Extent Page Count. Ahhh, this is even more useful!Even more useful, the SQL Server 2008 version ONLY returns page usage for tempdb regardless of the database it is executed from. This is cool because I want to see what user objects (temp tables, etc), internal objects (spools, hash joins) are consuming tempdb.
However, with SQL Server Denali, the sys.dm_db_file_space_usage works on ANY database.
Use AdventureWorksDenali
GO

select * from sys.dm_db_file_space_usage

Use tempdb
GO

select * from sys.dm_db_file_space_usage

So make the most of this view and keep an eye on what's being allocated in your databases, particularly tempdb.