Basics of Database Design

Author

Matthew Jannetti

Published

August 22, 2023

Goal

This portion of the workshop is to give you some grounding in database design principles and describe how to accomplish them.

Outline

Normalization
Keys
Constraints
Data Defintion and Maniuplation Language
Views

Normalization

The process of turning a flat dataset into a series of tables. The primary goal of normalization is to reduce redunancy in the data and increase consistency.

One big advantage of this consistency is that whenever you need to change the data you only need to update the data once in one place.

Let’s take a look at an example of how this might work.

Example

In our database we have tract information and weather information. If we have that information in a flat file prior to processing it for the database it might look something like the table below.

	GeoId	PhiladelphiaTract	Year	Month	AvgTemp	MinTemp	MaxTemp
0	42101000100	1	2006	April	14.880681	9.247916	20.515867
1	42101000100	1	2006	August	26.294310	21.673769	30.915272
2	42101000100	1	2006	December	6.918378	2.775594	11.061740
3	42101000100	1	2006	February	3.441863	-0.572194	7.456766
4	42101000100	1	2006	January	5.330633	1.094335	9.568929
5	42101000100	1	2006	July	27.479420	22.684456	32.275227
6	42101000100	1	2006	June	23.696087	18.939714	28.453455
7	42101000100	1	2006	March	8.066606	3.266680	12.867531
8	42101000100	1	2006	May	18.726263	13.191011	24.262665
9	42101000100	1	2006	November	10.846342	6.812098	14.881163
10	42101000100	1	2006	October	14.163409	9.614525	18.713562
11	42101000100	1	2006	September	20.172976	16.086805	24.260143
12	42101000100	1	2007	April	10.428053	5.957427	14.899525
13	42101000100	1	2007	August	24.970825	20.111498	29.831573
14	42101000100	1	2007	December	3.296346	0.263161	6.331954
15	42101000100	1	2007	February	-2.388883	-6.133591	1.355403
16	42101000100	1	2007	January	3.623262	-0.261454	7.508401
17	42101000100	1	2007	July	25.410351	20.846806	29.975315
18	42101000100	1	2007	June	23.762081	18.822567	28.702440
19	42101000100	1	2007	March	6.773147	1.507581	12.041290
20	42101000100	1	2007	May	19.042379	13.317002	24.769484
21	42101000100	1	2007	November	8.053829	4.026049	12.082455
22	42101000100	1	2007	October	18.499706	13.852238	23.148750
23	42101000100	1	2007	September	22.557858	17.526993	27.589718

This is the weather information for one census tract in the years 2006 and 2007. The weather information is defined at the month level and the tract information is defined at the tract level. Because there are two different levels, the tract information is repeated for every month.

If we store the data in this way, changing the tract information can become onerous. Let’s say that tract 42101000100 was misclassifed, and that the value of LowPop for that census tract should be 1 instead of 0. To update that information in this table we would need to update each row. Missing even a single row would add inconsistency to the data.

This becomes more difficult the more data you have. Let’s add the crime data to our example.

	GeoId	LowPop	PhiladelphiaTract	Year	CTVIOLCN	CTPROPCN
0	42101000100	0	1	2006	158	327
1	42101000100	0	1	2007	126	342

The crime data is at the year level. And so now, someone updates the tract information in the weather data but not the crime data, we have an inconsistency.

The normalizaton process will try to solve this problem.

There are different levels of normalization. The more normalized the data the more consistent your data becomes but go too far and the database becomes unweidly to use. Most databases target the “3rd Normal Form” which essentially means that every important “level” is attached to a key.

Each “level” is defined by it’s ability to uniquely identify data (known as candidate keys).

In our example, the candidate keys are as follows: 1. GeoId for the tract data 2. GeoId, Year, and Month for the weather data 3. GeoId and year for the Crime data

So we split this up into 3 different tables, each with an ID that can be used to join with other tables

Tract Table

	TractId	GeoId	LowPop	PhiladelphiaTract	Year
0	24203	42101000100	0	1	2010

Weather Table

	WeatherId	TractId	Year	Month	AvgTemp	MinTemp	MaxTemp
0	36964	24203	2006	April	14.880681	9.247916	20.515867
1	36968	24203	2006	August	26.294310	21.673769	30.915272
2	36972	24203	2006	December	6.918378	2.775594	11.061740
3	36962	24203	2006	February	3.441863	-0.572194	7.456766
4	36961	24203	2006	January	5.330633	1.094335	9.568929
5	36967	24203	2006	July	27.479420	22.684456	32.275227
6	36966	24203	2006	June	23.696087	18.939714	28.453455
7	36963	24203	2006	March	8.066606	3.266680	12.867531
8	36965	24203	2006	May	18.726263	13.191011	24.262665
9	36971	24203	2006	November	10.846342	6.812098	14.881163
10	36970	24203	2006	October	14.163409	9.614525	18.713562
11	36969	24203	2006	September	20.172976	16.086805	24.260143
12	57748	24203	2007	April	10.428053	5.957427	14.899525
13	57752	24203	2007	August	24.970825	20.111498	29.831573
14	57756	24203	2007	December	3.296346	0.263161	6.331954
15	57746	24203	2007	February	-2.388883	-6.133591	1.355403
16	57745	24203	2007	January	3.623262	-0.261454	7.508401
17	57751	24203	2007	July	25.410351	20.846806	29.975315
18	57750	24203	2007	June	23.762081	18.822567	28.702440
19	57747	24203	2007	March	6.773147	1.507581	12.041290
20	57749	24203	2007	May	19.042379	13.317002	24.769484
21	57755	24203	2007	November	8.053829	4.026049	12.082455
22	57754	24203	2007	October	18.499706	13.852238	23.148750
23	57753	24203	2007	September	22.557858	17.526993	27.589718

Crime Table

	CrimeDataId	TractId	Year	CTVIOLCN	CTPROPCN
0	22838	24203	2006	158	327
1	22839	24203	2007	126	342

Now, if you need to change tract information or the number of property crimes in a tract and year, you only need to change the data in one place and when people run queries, that updated data will propogate throughout all future datasets derived from the database automatically.

Keys

Keys are sort of like identifiers. We’ve already briefly touched on one such key, the candidate key.

A candidate key is defined as the minimum amount of information that you need to uniquely identify an entity in a table. So to identify each unique instance of weather data in the weather table we need to know the Tract, Year, and Month of the weather data.

A table can have multiple candidate keys. For instance if you collect data about individual people and you collect SSN and email, you might be able use either of those to identify the indiviuals uniquely.

There are two other remaining types of keys:

Primary Keys
Foreign Keys

Primary Keys

Primary keys are just a candidate key that you select to be a table identifier in your database. If you have both SSN and email, they are both candidate keys. When you define your table, you select SSN to be the primary key. There can only be 1 primary key per table. A primary key cannot contain NULL values and must be unique.

Foreign Keys

Foreign keys are a type of constraint, a limit on the data that can appear in a given column. A foreign key is a column that references another column (most commonly of another table). Values that appear in the foreign key column must also be in the reference columen (usually the primary key of another table)

The combination of Primary Keys and Foriegn keys form the basis of relationships between tables in a database. Looking for these in a database can tell you a lot about the structure of the database and how you should query it.

Constraints

One other big component of database design are constraints. As mentioned earlier, constraints are limitations that you can place on your data. All the data in a given column or columns must satisfy this constraint. If not you will not be able to insert that data.

There are several types of constraints in SQL, here are some common ones:

Constraint	Definition
NOT NULL	Column cannot contain NULL values
DEFAULT	If there is no value for a column provided on an insert, the default value will be inserted
UNIQUE	There can be no repeat values a column or combination of columns
CHECK	Defines specific values that are allowed to be in the column. No other values are allowed

The Schema

The schema is everything that defines the structure and data that appears in the database. Everything that we talked about above is part of the schema.

One last part of the schema to discuss is the column definitions.

Columns

When you create a table and it’s columns, you will provide it with a data type. This data type defines how the data is treated and what data can be inserted into it. A numeric column cannot contain string data. You can find the various data types for SQL Server here.

Data Definition and Manipulation Language

Data Defintion (DDL) and Data Maniuplation (DML) Language are two subsets of SQL that focus on creating, inserting, deleting and updating data. Whereas, Data Query Language (DQL), the subset focused on querying and pulling data, is what the first two sessions of this workshop foucused on.

DDL

Creates, deletes, or alters tables.

The most common DDL commands are DROP TABLE and CREATE TABLE

DROP TABLE deletes or removes a table from the database. The syntax is very simple:

DROP TABLE [tablename]

Depending on your database settings, this table will not be recoverable after deletion. Also the drop will fail if doing so will violate a constraint, like a foreign key.

CREATE TABLE is more complicated. It defines the table name and columns. I have supplied an example of two create table statements below that provide examples of numerous constraints and data types.

CREATE TABLE Address
    (
       AddressId               INTEGER PRIMARY KEY IDENTITY NOT NULL,
       StreetNum               INTEGER NOT NULL,
       Street                  NVARCHAR(60) NOT NULL,
       Unit                    NVARCHAR(25),
       MultiUnit               BIT NOT NULL,
       RandomUnit              BIT DEFAULT 0,
       UnitChosen              NVARCHAR(25),
       TotalUnits              INTEGER,
       NeedsReturn             BIT DEFAULT 1 NOT NULL,
       Flag                    NVARCHAR(200),
       ReturnedLetterCode      NVARCHAR(2),
       FinalOutcome            INTEGER,
       FinalOutcomeCompleted   BIT DEFAULT 0,
       RemovedBeforeMailing    BIT,
       SamplingRound           INTEGER,
       Zone                    INTEGER,
       Route                   INTEGER,
       AM                      BIT NOT NULL DEFAULT 0,
       PM                      BIT NOT NULL DEFAULT 0,
       SAT                     BIT NOT NULL DEFAULT 0,
       Anytime                 BIT,
       Wave1_Visited           BIT,
       Wave1_Targeted          BIT,
       Wave1_Survey            BIT,
       Wave2_Visited           BIT,
       Wave2_Targeted          BIT,
       Wave2_Survey            BIT,
       cwk_id                  NVARCHAR(50),
       valid_sample            BIT DEFAULT 1,
       In_PN                   BIT DEFAULT 1,
       UniqueAddrIdx           AS CONCAT(StreetNum, LOWER(Street), LOWER(ISNULL(UnitChosen, ISNULL(Unit, 'unit num')))),

        CHECK ((RandomUnit=0 AND UnitChosen IS NULL)
                OR (RandomUnit=1 AND UnitChosen IS NOT NULL)),
        CHECK (ReturnedLetterCode IN ('VA', 'NDAA', 'ANK', 'NSN', 'STUDENT', 'REF', 'No Apt Num', NULL)),
        CHECK ((Unit IS NULL AND MultiUnit=0) OR (Unit IS NOT NULL AND MultiUnit=1))
    );

    CREATE UNIQUE INDEX UQ_address_idx ON Address (UniqueAddrIdx);

CREATE TABLE Visit
    (VisitId INTEGER PRIMARY KEY IDENTITY NOT NULL,
    AddressId INTEGER NOT NULL,
    DateVisited DATETIME NOT NULL,
    Surveyor1Id INTEGER NOT NULL,
    Surveyor2Id INTEGER NOT NULL,
    Outcome NVARCHAR(50) NOT NULL,
    DidScreener BIT NOT NULL DEFAULT 0,
    Kids NVARCHAR(50) NOT NULL,
    Notes NVARCHAR(300),
    DateEntered DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    Scheduled DATETIME,
    NeedsReview BIT NOT NULL DEFAULT 0,
    AddedBy NVARCHAR(50),
    Attempt BIT NOT NULL DEFAULT 1,
    FinishedReview BIT,

    CONSTRAINT FK_Address_Visit FOREIGN KEY(AddressId) REFERENCES Address (AddressId),
    CONSTRAINT FK_Surveyor1_Visit FOREIGN KEY(Surveyor1Id) REFERENCES Surveyor (SurveyorId),
    CONSTRAINT FK_Surveyor2_Visit FOREIGN KEY(Surveyor2Id) REFERENCES Surveyor (SurveyorId),

    CONSTRAINT CK_Visit_DateVisited CHECK(DateVisited <= CURRENT_TIMESTAMP),
    CONSTRAINT CK_Visit_ScheduledOutcome CHECK   (Scheduled IS NOT NULL AND Outcome = 'Scheduled Survey' OR (Scheduled IS NULL AND Outcome <> 'Scheduled Survey')),
    CONSTRAINT CK_Visit_KidsOutcome CHECK ((Kids = 'Did not speak with anyone'
                                                  AND Outcome IN ('Not Home', 'Non-Residential', 'Non-Existent', 'Housing Type Ineligible', 'No Matching Unit', 'Partial Address', 'Unable to Access', 'Vacant',
                                                                        'Follow Up Letter', 'Missing Visit Info', 'Letter Response', 'Random Selection Form'))
                                                  OR (Kids = 'Spoke with someone but did not confirm'
                                                      AND Outcome IN ('Not Interested', 'No Adult English Speaker', 'No Caregiver Available', 'Follow Up Letter', 'Missing Visit Info', 'Letter Response'))
                                                  OR (Kids = 'Confirmed no kids live in home'
                                                      AND Outcome IN ('Not Eligible', 'No Caregiver Available', 'No Adult English Speaker', 'Missing Visit Info', 'Housing Type Ineligible', 'Letter Response'))
                                                  OR (Kids = 'Confirmed kids live in home'
                                                      AND Outcome IN ('Did Survey', 'Not Interested', 'No Caregiver Available', 'No Adult English Speaker', 'Unable to Complete Survey', 'Follow Up Letter',
                                                                            'Scheduled Survey', 'Missing Visit Info', 'Letter Response'))
                                                  OR (Kids = 'Missing Visit Info')
                                                  OR (Outcome = 'Missing Visit Info')),
    CONSTRAINT CK_Visit_Outcome CHECK (Outcome IN ('Did Survey', 'Unable to Complete Survey', 'Scheduled Survey', 'Not Interested', 'Not Eligible', 'No Caregiver Available', 'No Adult English Speaker',
                                                               'Not Home', 'Non-Residential', 'Housing Type Ineligible', 'Non-Existent', 'Vacant', 'Unable to Access', 'Partial Address', 'No Matching Unit Number',
                                                               'Follow Up Letter', 'Missing Visit Info', 'Letter Response', 'Random Selection Form')),
    CONSTRAINT CK_Visit_Kids CHECK (Kids IN ('Confirmed kids live in home', 'Confirmed no kids live in home', 'Spoke with someone but did not confirm', 'Did not speak with anyone', 'Missing Visit Info')),
    CONSTRAINT CK_Visit_Surveyors CHECK (Surveyor1Id <> Surveyor2Id),
    CONSTRAINT CK_Visit_Scheduled CHECK (CONVERT(DATETIME, Scheduled) > CONVERT(DATETIME, DateVisited)),
    CONSTRAINT UQ_Visit_AddressId_DateVisited UNIQUE(AddressId, DateVisited));
GO

DML

Adds, edits, or deletes rows of data within tables

The most common commands are INSERT, UPDATE, and DELETE.

INSERT reads data into a database table:

INSERT INTO Tract (GeoId, LowPop, PhiladelphiaTract)
VALUES ('42101000100', 0, 1)

DELETE removes data from a table:

DELETE FROM tract
WHERE GeoId='42101000100'

UPDATE edits data in a table

UPDATE Tract
SET LowPop=0
WHERE GeoId='42101000100'

Views

Views are a predefined query that you can interact with like a table. They are often used for more complicated queries or for queries that are run partciulary often.

CREATE VIEW ReviewVisits
AS
   SELECT a.AddressId,
          a.StreetNum,
          a.Street,
          a.Unit,
          v.VisitId,
          v.DateVisited,
          CONCAT_WS(' ', s.FirstName, s.LastName) AS Surveyor1,
          CONCAT_WS(' ', s2.FirstName, s2.LastName) AS Surveyor2,
          v.Outcome,
          v.Kids,
          v.Notes,
          v.DateEntered,
          v.Scheduled,
          v.AddedBy
     FROM Address a
          INNER JOIN Visit v ON v.AddressId = a.AddressId
          INNER JOIN Surveyor s ON v.Surveyor1Id = s.SurveyorId
          INNER JOIN Surveyor s2 ON v.Surveyor2Id = s2.SurveyorId
    WHERE v.NeedsReview = 1 AND (v.FinishedReview <> 1 or v.FinishedReview IS NULL);

Then once the view is created you can run the query by just doing

SELECT * FROM ReviewVisits

You can even use the view in joins.