Basics of Database Design

Author

Matthew Jannetti

Published

August 22, 2023

Goal

This portion of the workshop is to give you some grounding in database design principles and describe how to accomplish them.

Outline

  1. Normalization
  2. Keys
  3. Constraints
  4. Data Defintion and Maniuplation Language
  5. Views

Normalization

The process of turning a flat dataset into a series of tables. The primary goal of normalization is to reduce redunancy in the data and increase consistency.

One big advantage of this consistency is that whenever you need to change the data you only need to update the data once in one place.

Let’s take a look at an example of how this might work.

Example

In our database we have tract information and weather information. If we have that information in a flat file prior to processing it for the database it might look something like the table below.

GeoId LowPop PhiladelphiaTract Year Month AvgTemp MinTemp MaxTemp
0 42101000100 0 1 2006 April 14.880681 9.247916 20.515867
1 42101000100 0 1 2006 August 26.294310 21.673769 30.915272
2 42101000100 0 1 2006 December 6.918378 2.775594 11.061740
3 42101000100 0 1 2006 February 3.441863 -0.572194 7.456766
4 42101000100 0 1 2006 January 5.330633 1.094335 9.568929
5 42101000100 0 1 2006 July 27.479420 22.684456 32.275227
6 42101000100 0 1 2006 June 23.696087 18.939714 28.453455
7 42101000100 0 1 2006 March 8.066606 3.266680 12.867531
8 42101000100 0 1 2006 May 18.726263 13.191011 24.262665
9 42101000100 0 1 2006 November 10.846342 6.812098 14.881163
10 42101000100 0 1 2006 October 14.163409 9.614525 18.713562
11 42101000100 0 1 2006 September 20.172976 16.086805 24.260143
12 42101000100 0 1 2007 April 10.428053 5.957427 14.899525
13 42101000100 0 1 2007 August 24.970825 20.111498 29.831573
14 42101000100 0 1 2007 December 3.296346 0.263161 6.331954
15 42101000100 0 1 2007 February -2.388883 -6.133591 1.355403
16 42101000100 0 1 2007 January 3.623262 -0.261454 7.508401
17 42101000100 0 1 2007 July 25.410351 20.846806 29.975315
18 42101000100 0 1 2007 June 23.762081 18.822567 28.702440
19 42101000100 0 1 2007 March 6.773147 1.507581 12.041290
20 42101000100 0 1 2007 May 19.042379 13.317002 24.769484
21 42101000100 0 1 2007 November 8.053829 4.026049 12.082455
22 42101000100 0 1 2007 October 18.499706 13.852238 23.148750
23 42101000100 0 1 2007 September 22.557858 17.526993 27.589718

This is the weather information for one census tract in the years 2006 and 2007. The weather information is defined at the month level and the tract information is defined at the tract level. Because there are two different levels, the tract information is repeated for every month.

If we store the data in this way, changing the tract information can become onerous. Let’s say that tract 42101000100 was misclassifed, and that the value of LowPop for that census tract should be 1 instead of 0. To update that information in this table we would need to update each row. Missing even a single row would add inconsistency to the data.

This becomes more difficult the more data you have. Let’s add the crime data to our example.

GeoId LowPop PhiladelphiaTract Year CTVIOLCN CTPROPCN
0 42101000100 0 1 2006 158 327
1 42101000100 0 1 2007 126 342

The crime data is at the year level. And so now, someone updates the tract information in the weather data but not the crime data, we have an inconsistency.

The normalizaton process will try to solve this problem.

There are different levels of normalization. The more normalized the data the more consistent your data becomes but go too far and the database becomes unweidly to use. Most databases target the “3rd Normal Form” which essentially means that every important “level” is attached to a key.

Each “level” is defined by it’s ability to uniquely identify data (known as candidate keys).

In our example, the candidate keys are as follows: 1. GeoId for the tract data 2. GeoId, Year, and Month for the weather data 3. GeoId and year for the Crime data

So we split this up into 3 different tables, each with an ID that can be used to join with other tables

Tract Table

TractId GeoId LowPop PhiladelphiaTract Year
0 24203 42101000100 0 1 2010

Weather Table

WeatherId TractId Year Month AvgTemp MinTemp MaxTemp
0 36964 24203 2006 April 14.880681 9.247916 20.515867
1 36968 24203 2006 August 26.294310 21.673769 30.915272
2 36972 24203 2006 December 6.918378 2.775594 11.061740
3 36962 24203 2006 February 3.441863 -0.572194 7.456766
4 36961 24203 2006 January 5.330633 1.094335 9.568929
5 36967 24203 2006 July 27.479420 22.684456 32.275227
6 36966 24203 2006 June 23.696087 18.939714 28.453455
7 36963 24203 2006 March 8.066606 3.266680 12.867531
8 36965 24203 2006 May 18.726263 13.191011 24.262665
9 36971 24203 2006 November 10.846342 6.812098 14.881163
10 36970 24203 2006 October 14.163409 9.614525 18.713562
11 36969 24203 2006 September 20.172976 16.086805 24.260143
12 57748 24203 2007 April 10.428053 5.957427 14.899525
13 57752 24203 2007 August 24.970825 20.111498 29.831573
14 57756 24203 2007 December 3.296346 0.263161 6.331954
15 57746 24203 2007 February -2.388883 -6.133591 1.355403
16 57745 24203 2007 January 3.623262 -0.261454 7.508401
17 57751 24203 2007 July 25.410351 20.846806 29.975315
18 57750 24203 2007 June 23.762081 18.822567 28.702440
19 57747 24203 2007 March 6.773147 1.507581 12.041290
20 57749 24203 2007 May 19.042379 13.317002 24.769484
21 57755 24203 2007 November 8.053829 4.026049 12.082455
22 57754 24203 2007 October 18.499706 13.852238 23.148750
23 57753 24203 2007 September 22.557858 17.526993 27.589718

Crime Table

CrimeDataId TractId Year CTVIOLCN CTPROPCN
0 22838 24203 2006 158 327
1 22839 24203 2007 126 342

Now, if you need to change tract information or the number of property crimes in a tract and year, you only need to change the data in one place and when people run queries, that updated data will propogate throughout all future datasets derived from the database automatically.

Keys

Keys are sort of like identifiers. We’ve already briefly touched on one such key, the candidate key.

A candidate key is defined as the minimum amount of information that you need to uniquely identify an entity in a table. So to identify each unique instance of weather data in the weather table we need to know the Tract, Year, and Month of the weather data.

A table can have multiple candidate keys. For instance if you collect data about individual people and you collect SSN and email, you might be able use either of those to identify the indiviuals uniquely.

There are two other remaining types of keys:

  1. Primary Keys
  2. Foreign Keys

Primary Keys

Primary keys are just a candidate key that you select to be a table identifier in your database. If you have both SSN and email, they are both candidate keys. When you define your table, you select SSN to be the primary key. There can only be 1 primary key per table. A primary key cannot contain NULL values and must be unique.

Foreign Keys

Foreign keys are a type of constraint, a limit on the data that can appear in a given column. A foreign key is a column that references another column (most commonly of another table). Values that appear in the foreign key column must also be in the reference columen (usually the primary key of another table)

The combination of Primary Keys and Foriegn keys form the basis of relationships between tables in a database. Looking for these in a database can tell you a lot about the structure of the database and how you should query it.

Constraints

One other big component of database design are constraints. As mentioned earlier, constraints are limitations that you can place on your data. All the data in a given column or columns must satisfy this constraint. If not you will not be able to insert that data.

There are several types of constraints in SQL, here are some common ones:

Constraint Definition
NOT NULL Column cannot contain NULL values
DEFAULT If there is no value for a column provided on an insert, the default value will be inserted
UNIQUE There can be no repeat values a column or combination of columns
CHECK Defines specific values that are allowed to be in the column. No other values are allowed

The Schema

The schema is everything that defines the structure and data that appears in the database. Everything that we talked about above is part of the schema.

One last part of the schema to discuss is the column definitions.

Columns

When you create a table and it’s columns, you will provide it with a data type. This data type defines how the data is treated and what data can be inserted into it. A numeric column cannot contain string data. You can find the various data types for SQL Server here.

Data Definition and Manipulation Language

Data Defintion (DDL) and Data Maniuplation (DML) Language are two subsets of SQL that focus on creating, inserting, deleting and updating data. Whereas, Data Query Language (DQL), the subset focused on querying and pulling data, is what the first two sessions of this workshop foucused on.

DDL

Creates, deletes, or alters tables.

The most common DDL commands are DROP TABLE and CREATE TABLE

DROP TABLE deletes or removes a table from the database. The syntax is very simple:

DROP TABLE [tablename]

Depending on your database settings, this table will not be recoverable after deletion. Also the drop will fail if doing so will violate a constraint, like a foreign key.

CREATE TABLE is more complicated. It defines the table name and columns. I have supplied an example of two create table statements below that provide examples of numerous constraints and data types.

CREATE TABLE Address
    (
       AddressId               INTEGER PRIMARY KEY IDENTITY NOT NULL,
       StreetNum               INTEGER NOT NULL,
       Street                  NVARCHAR(60) NOT NULL,
       Unit                    NVARCHAR(25),
       MultiUnit               BIT NOT NULL,
       RandomUnit              BIT DEFAULT 0,
       UnitChosen              NVARCHAR(25),
       TotalUnits              INTEGER,
       NeedsReturn             BIT DEFAULT 1 NOT NULL,
       Flag                    NVARCHAR(200),
       ReturnedLetterCode      NVARCHAR(2),
       FinalOutcome            INTEGER,
       FinalOutcomeCompleted   BIT DEFAULT 0,
       RemovedBeforeMailing    BIT,
       SamplingRound           INTEGER,
       Zone                    INTEGER,
       Route                   INTEGER,
       AM                      BIT NOT NULL DEFAULT 0,
       PM                      BIT NOT NULL DEFAULT 0,
       SAT                     BIT NOT NULL DEFAULT 0,
       Anytime                 BIT,
       Wave1_Visited           BIT,
       Wave1_Targeted          BIT,
       Wave1_Survey            BIT,
       Wave2_Visited           BIT,
       Wave2_Targeted          BIT,
       Wave2_Survey            BIT,
       cwk_id                  NVARCHAR(50),
       valid_sample            BIT DEFAULT 1,
       In_PN                   BIT DEFAULT 1,
       UniqueAddrIdx           AS CONCAT(StreetNum, LOWER(Street), LOWER(ISNULL(UnitChosen, ISNULL(Unit, 'unit num')))),

        CHECK ((RandomUnit=0 AND UnitChosen IS NULL)
                OR (RandomUnit=1 AND UnitChosen IS NOT NULL)),
        CHECK (ReturnedLetterCode IN ('VA', 'NDAA', 'ANK', 'NSN', 'STUDENT', 'REF', 'No Apt Num', NULL)),
        CHECK ((Unit IS NULL AND MultiUnit=0) OR (Unit IS NOT NULL AND MultiUnit=1))
    );

    CREATE UNIQUE INDEX UQ_address_idx ON Address (UniqueAddrIdx);

CREATE TABLE Visit
    (VisitId INTEGER PRIMARY KEY IDENTITY NOT NULL,
    AddressId INTEGER NOT NULL,
    DateVisited DATETIME NOT NULL,
    Surveyor1Id INTEGER NOT NULL,
    Surveyor2Id INTEGER NOT NULL,
    Outcome NVARCHAR(50) NOT NULL,
    DidScreener BIT NOT NULL DEFAULT 0,
    Kids NVARCHAR(50) NOT NULL,
    Notes NVARCHAR(300),
    DateEntered DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    Scheduled DATETIME,
    NeedsReview BIT NOT NULL DEFAULT 0,
    AddedBy NVARCHAR(50),
    Attempt BIT NOT NULL DEFAULT 1,
    FinishedReview BIT,

    CONSTRAINT FK_Address_Visit FOREIGN KEY(AddressId) REFERENCES Address (AddressId),
    CONSTRAINT FK_Surveyor1_Visit FOREIGN KEY(Surveyor1Id) REFERENCES Surveyor (SurveyorId),
    CONSTRAINT FK_Surveyor2_Visit FOREIGN KEY(Surveyor2Id) REFERENCES Surveyor (SurveyorId),

    CONSTRAINT CK_Visit_DateVisited CHECK(DateVisited <= CURRENT_TIMESTAMP),
    CONSTRAINT CK_Visit_ScheduledOutcome CHECK   (Scheduled IS NOT NULL AND Outcome = 'Scheduled Survey' OR (Scheduled IS NULL AND Outcome <> 'Scheduled Survey')),
    CONSTRAINT CK_Visit_KidsOutcome CHECK ((Kids = 'Did not speak with anyone'
                                                  AND Outcome IN ('Not Home', 'Non-Residential', 'Non-Existent', 'Housing Type Ineligible', 'No Matching Unit', 'Partial Address', 'Unable to Access', 'Vacant',
                                                                        'Follow Up Letter', 'Missing Visit Info', 'Letter Response', 'Random Selection Form'))
                                                  OR (Kids = 'Spoke with someone but did not confirm'
                                                      AND Outcome IN ('Not Interested', 'No Adult English Speaker', 'No Caregiver Available', 'Follow Up Letter', 'Missing Visit Info', 'Letter Response'))
                                                  OR (Kids = 'Confirmed no kids live in home'
                                                      AND Outcome IN ('Not Eligible', 'No Caregiver Available', 'No Adult English Speaker', 'Missing Visit Info', 'Housing Type Ineligible', 'Letter Response'))
                                                  OR (Kids = 'Confirmed kids live in home'
                                                      AND Outcome IN ('Did Survey', 'Not Interested', 'No Caregiver Available', 'No Adult English Speaker', 'Unable to Complete Survey', 'Follow Up Letter',
                                                                            'Scheduled Survey', 'Missing Visit Info', 'Letter Response'))
                                                  OR (Kids = 'Missing Visit Info')
                                                  OR (Outcome = 'Missing Visit Info')),
    CONSTRAINT CK_Visit_Outcome CHECK (Outcome IN ('Did Survey', 'Unable to Complete Survey', 'Scheduled Survey', 'Not Interested', 'Not Eligible', 'No Caregiver Available', 'No Adult English Speaker',
                                                               'Not Home', 'Non-Residential', 'Housing Type Ineligible', 'Non-Existent', 'Vacant', 'Unable to Access', 'Partial Address', 'No Matching Unit Number',
                                                               'Follow Up Letter', 'Missing Visit Info', 'Letter Response', 'Random Selection Form')),
    CONSTRAINT CK_Visit_Kids CHECK (Kids IN ('Confirmed kids live in home', 'Confirmed no kids live in home', 'Spoke with someone but did not confirm', 'Did not speak with anyone', 'Missing Visit Info')),
    CONSTRAINT CK_Visit_Surveyors CHECK (Surveyor1Id <> Surveyor2Id),
    CONSTRAINT CK_Visit_Scheduled CHECK (CONVERT(DATETIME, Scheduled) > CONVERT(DATETIME, DateVisited)),
    CONSTRAINT UQ_Visit_AddressId_DateVisited UNIQUE(AddressId, DateVisited));
GO

DML

Adds, edits, or deletes rows of data within tables

The most common commands are INSERT, UPDATE, and DELETE.

INSERT reads data into a database table:

INSERT INTO Tract (GeoId, LowPop, PhiladelphiaTract)
VALUES ('42101000100', 0, 1)

DELETE removes data from a table:

DELETE FROM tract
WHERE GeoId='42101000100'

UPDATE edits data in a table

UPDATE Tract
SET LowPop=0
WHERE GeoId='42101000100'

Views

Views are a predefined query that you can interact with like a table. They are often used for more complicated queries or for queries that are run partciulary often.

CREATE VIEW ReviewVisits
AS
   SELECT a.AddressId,
          a.StreetNum,
          a.Street,
          a.Unit,
          v.VisitId,
          v.DateVisited,
          CONCAT_WS(' ', s.FirstName, s.LastName) AS Surveyor1,
          CONCAT_WS(' ', s2.FirstName, s2.LastName) AS Surveyor2,
          v.Outcome,
          v.Kids,
          v.Notes,
          v.DateEntered,
          v.Scheduled,
          v.AddedBy
     FROM Address a
          INNER JOIN Visit v ON v.AddressId = a.AddressId
          INNER JOIN Surveyor s ON v.Surveyor1Id = s.SurveyorId
          INNER JOIN Surveyor s2 ON v.Surveyor2Id = s2.SurveyorId
    WHERE v.NeedsReview = 1 AND (v.FinishedReview <> 1 or v.FinishedReview IS NULL);

Then once the view is created you can run the query by just doing

SELECT * FROM ReviewVisits

You can even use the view in joins.