Interview With SQL Server MVP Louis Davidson: Author Of Pro SQL Server 2005 Database Design and Optimization
Louis, the first time I came across your name was in the Microsoft public SQL Server programming group. Can you explain to us why you participate in newsgroups and forums?
I have been involved with newsgroups since back in college when I was an avid reader of rec.music.beatles. And when I found the SQL Server groups (back then, before Microsoft’s groups it was comp.databases.ms-sqlserver), I started reading them, and finally giving some feedback. I then spent time in the newsgroups trying to answer people’s questions to help them, and help me extend my knowledge past school and technical books; because the real situations that people get into are more interesting than anything you can get in a textbook.
I quit hanging around for quite a while because the public newsgroups (just got too doggone noisy. I mean come on, there is only so often that one can have certain, “personal” products sold to them.
But about three years ago, I went to a session with Kalen Delaney (who was then, and is now one of the people who I look up to greatly,) who was talking about things to do to promote your books. I started then trying to answer three questions a day (or more) on the Microsoft newsgroups (instead of the public ones) and now on the MSDN Forums (http://forums.microsoft.com/MSDN/ShowForum.aspx?ForumID=85&SiteID=1).
I don’t know that it helped my book sales in any overwhelmingly large way, but it certainly has helped me in my career. I find that so many things I have learned while helping others end up being useful at my day job. Plus, my philosophy in regards to SQL Server is simple: the more people who like SQL Server the more likely that the product will be around when I retire in 30-40 years.
I have noticed that you have explained the 12 rules of Codd, do you think that it is important to know these rules as a database developer?
Absolutely, if you don’t know why you do something, then it is just too easy to think that it isn’t worth it. And you can’t really bend the rules until you know them. Codd’s rules are the foundation that the RDBMS was built upon, which the begat normalization. Once people get that SQL Server is essentially a tool largely created to manipulate and serve up normalized data (a concept that was built upon the original 12 rules) it actually becomes essential to normalize, not just something that some square geezer is telling you that to do.
Why did you decide to write technical books?
1. So I would have a reference to refer to when I need to recall something.
2. Because I hate having free time.
Seriously, I had really done quite badly in my senior level database design class in college, but good enough (I think I made a C.) So when I was thrust into the database programmer role, I wanted to learn how to do it right (something my father had not so subtlety driven into my brain years before.) What made things even better was that I had several mentors who believed in doing things right during my first six or seven years of programming, so I learned essentially only good practices, but I still wanted more. I couldn’t find a book on database design that was not written as a text book, so I suggested a book on database design to the editors at WROX, whom I was working for as a tech editor on a Visual Basic/SQL Server book. I expected to tech edit this design book, or possibly write a chapter or two, but before I knew it they had contracted me to write the entire book (Chuck Hawkins wrote a chapter for me, but I wrote the rest) and paid me what seemed like a ton of money (of course, if I average it out to an hourly wage, I averaged far less than minimum wage when all was said and done.)
The process of writing the book was an amazing journey, one because I had never written a paper greater than like 15 pages (my degree is in CS, so all of our papers were really quite small in comparison) and I had to learn a good deal of stuff along the way because I was not really an “expert” at the time (and the more I know, the less I feel like an expert now!) The thing about writing is that you have to be more correct than in almost any medium. When you speak on a subject, if you say something, the people in the audience go, “whu?” and if you talk fast enough and right enough, they will likely forgive you (as long as you don’t do it again.) But if you mess up in writing in a book, it just hangs there over your head forever.
When it was over I had something I was proud of and I said what every technical book writer I have talked with says: “NEVER AGAIN! Too much work, not enough gain, I could make more money cleaning out stalls at the local pet store!!!” I have now said more or less the same thing three times now, and I hope to say it again around the time Katmai ships.
Which chapter was the hardest to write and can you explain why?
Well, the ones that were the hardest were actually the easiest, since I got the Kevin Kline and Kurt Windisch to write them. These guys were excellent co-authors, and did great work. Kurt’s section on CLR programming is great, with some awesome example code that I often use when I need to build a CLR function. Kevin wrote a chapter about cross-platform development that rounded out the book.
Of the chapters I wrote, the most troublesome was probably the security chapter. Security is a very complex topic and there are many, many ways to enforce security. Even worse, there are too many ways to get around it once you have it set up. It is the one chapter that I am not completely happy with, and the one that I plan to do some writing on again in the next few months, most likely after the holidays, during which I hope to do very very little that isn’t concerned with eating, sleeping, or going to Disney World.
What is the audience for this book?
Well, with this being my third book I was kind of hoping that my audience had grown to Dogbert
-like proportions, but I have to be reasonable. I would simply be satisfied if I could get every SQL Server programmer to read it (and buy it!) The overall goal was to make a practical guide going through the boring bits (theory, modeling), the stuff you hear about but rarely really understand (normalization) and to include the more practical topics like constraints, triggers, security, etc that even people who have years of experience would like.
I think that most people will get something from the book, from the newbie to the pro. Of course the payoff will be far less for someone who already knows the theory, but there is something in there for everyone that will at least be worth your 60 bucks (which is the full retail price. Don’t pay full retail price!)
I also try to keep it as light as possible, considering the very dry topic. There are several little humorous bits thrown in, and for fun I added a hidden message for a contest to run later (actually going on right now in the end of 2006.) All of the clues are leading to a message that I “encrypted” into a bit of code in the book (and yes, that is bonus clue for reading through this really long interview!) for a bit of fun.
Ultimately, I want the book to be something that I would purchase in each version at the very least for the practical aspects of the book. You might think that since I wrote the book I wouldn’t refer to it, but I often use the templates for triggers, procedures, error handling, etc from the book whenever I need to write one without using some tool.
Which SQL Server books are on your bookshelf?
Well, to tell the truth, I have quite a few on my desk and no time to read them. I have thumbed through Itzik’s book, “Inside Microsoft SQL Server 2005: T-SQL Programming”, and it is pretty darn good. Dejan Sarka, who was one of my technical editors worked on that book too. Beauchemin and Sullivan’s “A Developer’s Guide to SQL Server 2005” seems pretty good, and I have used Tachev’s book, “Applied Microsoft Analysis Services 2005” several times to try to figure out how different Analysis Services was in the 2005 version.
And I would be remiss to not mention Kalen’s previous “Inside SQL Server 2005” book. What a great reference that book was. I can’t wait to get my hands on her new books (not that I would have the time to read them either!)
What are/could be some consequences if your data model is not in Third Normal Form or even First Normal Form?
The earth could stop spinning on its axis, at least that is what I try to tell people. The more reasonable answer is that you will have nagging, annoying issues with your data that will require more and more management just to keep straight.
Normalization pushes data consistency issues to the foreground and forces you to deal with them. For example, take first normal form. One of the rules of first normal form is that every row is unique. A constant question on newsgroups/forums is: “I have duplicate rows, how do I get rid of them?” Well, go through this following messy set of steps, and then PROTECT YOURSELF USING KEYS!
Another concern of first normal form is scalar attributes. I like to define this as keeping your data broken down to the lowest form you will manipulate using SQL. So if you have to use substring for any common operation, they you likely have an issue. For example take an address, if you use addresses just as the user types them in, then just have a text column, put the address in it. But if you have a mailing system and you have to cleanse the data, and then break the address down into all of its constituent parts instead of going through the process every time you need an address. SQL is great for taking bits and pieces and constructing what you need, but lousy at breaking things down.
For third normal form, the problems are more ominous. If the data in the table isn’t really describing the thing that the table is modeling, you will end up with redundant data. Redundant data doesn’t sound so bad, until you have need to update the data. Consider the case where you have a table modeling a person, and in this table you have the person’s manager, the manager’s name and address. From this sentence, it almost sounds plausible and correct. It certainly is something you might want to see on a screen. But is the manager just the manager for the one person? If not you might have ten copies of the managers information. Now you have to make sure and deal with all ten copies, which is a real pain, to say the least.
How did the massive changes between SQL Server 2000 and SQL Server 2005 affect the research for your book?
The beauty for me was that the truly massive changes made little difference. Since the book I have written is about database design, things haven’t changed so incredibly much from that angle in any version of SQL Server. SQL evolves and gets better and there are optimization changes, but the basics remain the same. Fortunately as machines get more powerful we need to tweak performance less and less and we can just focus on data integrity and usability through normalization.
The biggest differences that I am interested in from an authoring standpoint have been in the ability to do more with T-SQL (I wrote the T-SQL for Developer’s chapter in the Pro SQL Server 2005 book for Apress), and now using the CLR functions. I had Kurt Windisch (a friend who was a board member for PASS at the time) write a very large CLR section, and I am so glad. Now I have a nice reference and excellent code samples for CLR objects. It is one of the great things about a book like this in that it is not overview book where you simply give the new features and come up with some reason you might want to use the features. If the feature isn’t tremendously useful, we can say that it isn’t useful. I think we got it pretty much right by saying that functions are really the most useful part of the new CLR objects, as well as to a lesser extent aggregations.
I admit that I don’t cover XML, something that I lament occasionally. I had waffled on including it for months because I just don’t feel that it is that important to OLTP systems, but on the other hand it could certainly be useful to implement a set of sparse properties (a topic I do touch on) and had arranged for another coauthor set up to write me a chapter about XML, but they dropped out too late to get another one. Even today I am considering commissioning an “add-on” chapter about XML to send out to readers (if you want it, vote with your email to email@example.com and your wallets by purchasing the book so I can afford to pay someone. )
Name three things that are new in SQL Server 2005 that you find are the most valuable?
- Dynamic Management Objects. Wow, I can’t even begin to start on just how great these are. The best one to me is outlined in this blog entry: http://drsql.spaces.live.com/blog/cns!80677FB08B3162E4!1443.entry . Automatic index tuning is awesome. I have used the basis of this system to optimize a third party system in excellent ways using the number two feature.
- Included columns on indexes. These have really made it easier for dba’s to optimize queries by building covering indexes for queries without bloating indexes to death. What’s more, because you aren’t limited to 900 bytes, you can cover some queries in larger, more unwieldy systems that you just couldn’t do before. They aren’t the end all as they still have to be maintained, but they are certainly excellent in many situations. We have used them extensively to optimize a third party system that was very heavy on reads to cut down tremendously on full table scans.
- The new BI Paradigm. The attribute oriented nature of Analysis Services is just plain cool. It gives users this extremely rich set of possibilities for viewing data using their cubes that is just so much better than they had in 2000. It was a tremendous change in gears though, and I am very glad that the fundamentals of dimensional modeling didn’t change, so the Kimball paradigm data warehouse we had was still quite viable using either platform, just easier to use.
Name three things that you wish would be in Katmai (the next version of SQL Server)?
I should have finished this interview a few weeks earlier, and I could have spoken a bit more freely about this subject, but at this point I am under NDA for what I have heard about Katmai. What I will say is that I hope to see as much work done in the relational arena (T-SQL and the engine) as they do for the other areas like XML, or tools. The heart and soul of SQL Server should always be right there, lurking in the name: SQL. All of the other new stuff they have added over the years has made things better for us, but I would have traded many, many non relational features to have had the ROW_NUMBER() and RANK() functions in an earlier version.
Of course, this is just my very selfish view on things as a Data Architect/Relational Programmer. In this particular case, even more than normal, I particularly speak only for me.
How critical is a proper data model for a good performing database?
Freaking essential! This is like asking how important is it to know the race route to win the race. The SQL Server engine is amazingly good at serving up data, but it is up to the implementer to know what data is needed and to put the data into a format that works the way that SQL Server works.
I would also be remiss if I didn’t take this one step further. How can you build anything without design? Not one programmer would call up a contractor and say “build me a house, three bedrooms, two baths” and then just let the contractor sketch out a rough drawing and then start working would they? And while I realize that it seems that software is a lot more pliable than building a house, once you have built the foundation classes, the database, and started a UI, rework can be just as costly as running a bulldozer though a regular house, and on a particularly a large software project, far more expensive.
What are some of the biggest mistakes that people make when designing a database?
I did something this for Greg Low’s really cool SQL Down Under podcast (where you get a good idea that I am terrified of speaking publicly) but lets just boil it down to two here:
- Treating database design like application design - The reason why the concept of a distinct business layer is so intriguing to me is that we can transform the user’s views/needs from the GUI through this layer to the data layer. Each layer of the application has it’s own needs (UI, Object Layer Representation, Data Access Layer, Stored Procedure Layer, Database Structures) each very important in it’s own right. More importantly, each has very different design goals.
- Forgetting about the future – During the design phase of any project, changes are very, very cheap. So if you realize you want to have a table with a domain of values instead of a simple varchar column, it is as simple as adding a rectangle to your model and connecting rectangles. Same goes with the discovery that the cardinality between two of your tables was not what you originally thought (“oh, a person is allowed to have more than one phone number in this system.”) However, as time passes, and more code is written, and more data is added to the system, the costs go up almost exponentially. Of course when the project manager has saddled you up and is giving you the crop, demanding results (that is their job after all) it can be all too easy to simply ignore the future and start coding. Always think about Future-You and how annoyed he is going to be when he realizes that had you just spent another hour three months ago, two weeks of work (of course corresponding to Future-You’s vacation to Disney World) could have been avoided.
There are so many others, bad naming, poor constraint usage, under-using stored procedures, ignoring security, weak testing, and so on and so on. But I think I can safely say that if you start with these two concepts, a great number of these bad habits will correct themselves.
Can you list any third party tools that you find useful to have as a SQL Server developer/admin?
I have become quite enamored with the Red-Gate tools, especially the database compare tool. I use it quite often to compare databases to find changes that have been made, especially for indexes and stuff done for tuning a database. The Apex tools are pretty good too.
The one tool I have used a lot lately has been the Automatic Indexing stuff from the query optimization team: (http://blogs.msdn.com/queryoptteam/archive/2006/06/01/613516.aspx). I am working right now on optimizing a homegrown/third party “object oriented”-esque tool that has a great deal of object abstraction built in, and have found tremendous performance gains from some really esoteric indexes this thing added to some of the queries that I would have never thought of in a month of Sundays. I wrote about it in more detail here: (http://drsql.spaces.live.com/blog/cns!80677FB08B3162E4!1443.entry and yes, I know this is the second time I have mentioned it, it is that important!)
In general, I am not so much a tool guy as I am a script guy. I use the UI to script out objects quite often, but then I try to edit everything with T-SQL so I know how it is done. The people who get too comfortable in the wizards or GUI often forget or don’t learn all of the tricks that you can do with the raw SQL commands, because it would just be too hard to build the perfect UI. So they give you 50% in the wizards and 90% in the GUI, and the rest is “advanced” stuff. If you don’t know about the nuts and bolts, you might be tempted to say “SQL Server doesn’t…”.
Lastly I have always been a big fan of ERwin for modeling. It does a really good job of representing my structures, and I like their macro language for building very customized scripts to build tables/views/procedures/etc.
What is the biggest Database in GB/PB that you have ever worked with, how many rows was your biggest table ever?
It was about ten feet across, six feet high. I had just started my new job as a LAN administrator and occasionally I had to fill in for the night operator. We had these 14 inch across reels of something that looked like a reel to reel tape, but they….wait, do you mean gigabytes and petabytes? I was thinking Great Big, or Pretty Big.
Well definitely no petabytes, and only recently have we started thinking in terms of terabytes, some of which is due to the fact that we have started storing everything as Unicode, essentially doubling the size of the database. We probably have around a half of a terabyte in our OLTP system and Data warehouse combined. I am never too excited about the size of a database as I am of the number of rows in a key table. You could have 10 billion rows and a very small database because it was designed nicely, or to the contrary, a million rows and have a terabyte if you just stored all of your data in text blobs (which is a favorite kind of storage for too many people.)
The largest database I designed I never got to see come to fruition because our company went through a couple of downsizing “events”, which I was part of the next to last one. I only know it was large because a prior coworker of mine called me up one day and told me we had gone past the 2 billion row mark and they were to changing to use bigint rather than basic int types for the surrogate keys. Tons of rows, but really they were quite small rows. We were keeping information on Internet equipment that the company had deployed. It was a pretty cool system that I wish I could see in action today (if it is still being used. The concept was neat, but as the implementers we were never 100% sure if the data would be useful later in the process.)
When did you start blogging and why?
Blogging is possibly the biggest time sink on earth really. When I am working a “normal” 40 or so hour week in my day (read: paying) job, it is easy to spend an hour or so a day blogging, but when my job squeezes my time, it gets really hard.
But the most exciting part of blogging is the incredible active reference it gives me, in a format that really works for me. I can also use more of a shotgun approach and just write about what I want to write about on any topic, SQL or otherwise, rather than just writing about database design exclusively.
Over the years, I have tried creating libraries of code on my laptop, but that never has seemed to work for me. As I have aged (a little) I have started to realize that commenting, blogging, writing, etc is not for you, it is for Future Me and other people. I have so much code on my laptop from the past 10 years that I don’t know what the heck to do with it. I have VB6, SQL from past ages, failed article attempts, diagrams, etc all over my projects directory. So Past Me (that jerk!) is punishing me with his laziness, something I now vow not to do to Future Me. Confused? Good.
I started to realize that my blog had gotten out of control with so much information that I could no longer search and find when I found OfficeLive.com, a Microsoft site that allows you to create a free website for your small business. So I created drsql.org as a reference for my books, presentations, and my blog. On there you will find a decent enough index to my blog entries, usually within a month of being up to date, as well as a list of about ten articles that I can get to from the front page just about instantly.
About the book:
By Louis Davidson, Kevin Kline, Kurt Windisch
Published: May 2006
Table Of Contents
CHAPTER 1 Introduction to Database Concepts
CHAPTER 2 Data Modeling
CHAPTER 3 Conceptual Data Modeling
CHAPTER 4 The Normalization Process
CHAPTER 5 Implementing the Base Table Structures
CHAPTER 6 Protecting the Integrity of Your Data
CHAPTER 7 Securing Access to the Data
CHAPTER 8 Table Structures and Indexing
CHAPTER 9 Coding for Concurrency
CHAPTER 10 Code-Level Architectural Decisions
CHAPTER 11 Database Interoperability
APPENDIX A Codd’s 12 Rules for an RDBMS
APPENDIX B Datatype Reference
Ch. 02 - Data Modeling