GeoId | LowPop | PhiladelphiaTract | Year | Month | AvgTemp | MinTemp | MaxTemp | |
---|---|---|---|---|---|---|---|---|
0 | 42101000100 | 0 | 1 | 2006 | April | 14.880681 | 9.247916 | 20.515867 |
1 | 42101000100 | 0 | 1 | 2006 | August | 26.294310 | 21.673769 | 30.915272 |
2 | 42101000100 | 0 | 1 | 2006 | December | 6.918378 | 2.775594 | 11.061740 |
3 | 42101000100 | 0 | 1 | 2006 | February | 3.441863 | -0.572194 | 7.456766 |
4 | 42101000100 | 0 | 1 | 2006 | January | 5.330633 | 1.094335 | 9.568929 |
5 | 42101000100 | 0 | 1 | 2006 | July | 27.479420 | 22.684456 | 32.275227 |
6 | 42101000100 | 0 | 1 | 2006 | June | 23.696087 | 18.939714 | 28.453455 |
7 | 42101000100 | 0 | 1 | 2006 | March | 8.066606 | 3.266680 | 12.867531 |
8 | 42101000100 | 0 | 1 | 2006 | May | 18.726263 | 13.191011 | 24.262665 |
9 | 42101000100 | 0 | 1 | 2006 | November | 10.846342 | 6.812098 | 14.881163 |
10 | 42101000100 | 0 | 1 | 2006 | October | 14.163409 | 9.614525 | 18.713562 |
11 | 42101000100 | 0 | 1 | 2006 | September | 20.172976 | 16.086805 | 24.260143 |
12 | 42101000100 | 0 | 1 | 2007 | April | 10.428053 | 5.957427 | 14.899525 |
13 | 42101000100 | 0 | 1 | 2007 | August | 24.970825 | 20.111498 | 29.831573 |
14 | 42101000100 | 0 | 1 | 2007 | December | 3.296346 | 0.263161 | 6.331954 |
15 | 42101000100 | 0 | 1 | 2007 | February | -2.388883 | -6.133591 | 1.355403 |
16 | 42101000100 | 0 | 1 | 2007 | January | 3.623262 | -0.261454 | 7.508401 |
17 | 42101000100 | 0 | 1 | 2007 | July | 25.410351 | 20.846806 | 29.975315 |
18 | 42101000100 | 0 | 1 | 2007 | June | 23.762081 | 18.822567 | 28.702440 |
19 | 42101000100 | 0 | 1 | 2007 | March | 6.773147 | 1.507581 | 12.041290 |
20 | 42101000100 | 0 | 1 | 2007 | May | 19.042379 | 13.317002 | 24.769484 |
21 | 42101000100 | 0 | 1 | 2007 | November | 8.053829 | 4.026049 | 12.082455 |
22 | 42101000100 | 0 | 1 | 2007 | October | 18.499706 | 13.852238 | 23.148750 |
23 | 42101000100 | 0 | 1 | 2007 | September | 22.557858 | 17.526993 | 27.589718 |
Basics of Database Design
Goal
This portion of the workshop is to give you some grounding in database design principles and describe how to accomplish them.
Outline
- Normalization
- Keys
- Constraints
- Data Defintion and Maniuplation Language
- Views
Normalization
The process of turning a flat dataset into a series of tables. The primary goal of normalization is to reduce redunancy in the data and increase consistency.
One big advantage of this consistency is that whenever you need to change the data you only need to update the data once in one place.
Let’s take a look at an example of how this might work.
Example
In our database we have tract information and weather information. If we have that information in a flat file prior to processing it for the database it might look something like the table below.
This is the weather information for one census tract in the years 2006 and 2007. The weather information is defined at the month level and the tract information is defined at the tract level. Because there are two different levels, the tract information is repeated for every month.
If we store the data in this way, changing the tract information can become onerous. Let’s say that tract 42101000100
was misclassifed, and that the value of LowPop for that census tract should be 1 instead of 0. To update that information in this table we would need to update each row. Missing even a single row would add inconsistency to the data.
This becomes more difficult the more data you have. Let’s add the crime data to our example.
GeoId | LowPop | PhiladelphiaTract | Year | CTVIOLCN | CTPROPCN | |
---|---|---|---|---|---|---|
0 | 42101000100 | 0 | 1 | 2006 | 158 | 327 |
1 | 42101000100 | 0 | 1 | 2007 | 126 | 342 |
The crime data is at the year level. And so now, someone updates the tract information in the weather data but not the crime data, we have an inconsistency.
The normalizaton process will try to solve this problem.
There are different levels of normalization. The more normalized the data the more consistent your data becomes but go too far and the database becomes unweidly to use. Most databases target the “3rd Normal Form” which essentially means that every important “level” is attached to a key.
Each “level” is defined by it’s ability to uniquely identify data (known as candidate keys).
In our example, the candidate keys are as follows: 1. GeoId for the tract data 2. GeoId, Year, and Month for the weather data 3. GeoId and year for the Crime data
So we split this up into 3 different tables, each with an ID that can be used to join with other tables
Tract Table
TractId | GeoId | LowPop | PhiladelphiaTract | Year | |
---|---|---|---|---|---|
0 | 24203 | 42101000100 | 0 | 1 | 2010 |
Weather Table
WeatherId | TractId | Year | Month | AvgTemp | MinTemp | MaxTemp | |
---|---|---|---|---|---|---|---|
0 | 36964 | 24203 | 2006 | April | 14.880681 | 9.247916 | 20.515867 |
1 | 36968 | 24203 | 2006 | August | 26.294310 | 21.673769 | 30.915272 |
2 | 36972 | 24203 | 2006 | December | 6.918378 | 2.775594 | 11.061740 |
3 | 36962 | 24203 | 2006 | February | 3.441863 | -0.572194 | 7.456766 |
4 | 36961 | 24203 | 2006 | January | 5.330633 | 1.094335 | 9.568929 |
5 | 36967 | 24203 | 2006 | July | 27.479420 | 22.684456 | 32.275227 |
6 | 36966 | 24203 | 2006 | June | 23.696087 | 18.939714 | 28.453455 |
7 | 36963 | 24203 | 2006 | March | 8.066606 | 3.266680 | 12.867531 |
8 | 36965 | 24203 | 2006 | May | 18.726263 | 13.191011 | 24.262665 |
9 | 36971 | 24203 | 2006 | November | 10.846342 | 6.812098 | 14.881163 |
10 | 36970 | 24203 | 2006 | October | 14.163409 | 9.614525 | 18.713562 |
11 | 36969 | 24203 | 2006 | September | 20.172976 | 16.086805 | 24.260143 |
12 | 57748 | 24203 | 2007 | April | 10.428053 | 5.957427 | 14.899525 |
13 | 57752 | 24203 | 2007 | August | 24.970825 | 20.111498 | 29.831573 |
14 | 57756 | 24203 | 2007 | December | 3.296346 | 0.263161 | 6.331954 |
15 | 57746 | 24203 | 2007 | February | -2.388883 | -6.133591 | 1.355403 |
16 | 57745 | 24203 | 2007 | January | 3.623262 | -0.261454 | 7.508401 |
17 | 57751 | 24203 | 2007 | July | 25.410351 | 20.846806 | 29.975315 |
18 | 57750 | 24203 | 2007 | June | 23.762081 | 18.822567 | 28.702440 |
19 | 57747 | 24203 | 2007 | March | 6.773147 | 1.507581 | 12.041290 |
20 | 57749 | 24203 | 2007 | May | 19.042379 | 13.317002 | 24.769484 |
21 | 57755 | 24203 | 2007 | November | 8.053829 | 4.026049 | 12.082455 |
22 | 57754 | 24203 | 2007 | October | 18.499706 | 13.852238 | 23.148750 |
23 | 57753 | 24203 | 2007 | September | 22.557858 | 17.526993 | 27.589718 |
Crime Table
CrimeDataId | TractId | Year | CTVIOLCN | CTPROPCN | |
---|---|---|---|---|---|
0 | 22838 | 24203 | 2006 | 158 | 327 |
1 | 22839 | 24203 | 2007 | 126 | 342 |
Now, if you need to change tract information or the number of property crimes in a tract and year, you only need to change the data in one place and when people run queries, that updated data will propogate throughout all future datasets derived from the database automatically.
Keys
Keys are sort of like identifiers. We’ve already briefly touched on one such key, the candidate key.
A candidate key is defined as the minimum amount of information that you need to uniquely identify an entity in a table. So to identify each unique instance of weather data in the weather table we need to know the Tract, Year, and Month of the weather data.
A table can have multiple candidate keys. For instance if you collect data about individual people and you collect SSN and email, you might be able use either of those to identify the indiviuals uniquely.
There are two other remaining types of keys:
- Primary Keys
- Foreign Keys
Primary Keys
Primary keys are just a candidate key that you select to be a table identifier in your database. If you have both SSN and email, they are both candidate keys. When you define your table, you select SSN to be the primary key. There can only be 1 primary key per table. A primary key cannot contain NULL values and must be unique.
Foreign Keys
Foreign keys are a type of constraint, a limit on the data that can appear in a given column. A foreign key is a column that references another column (most commonly of another table). Values that appear in the foreign key column must also be in the reference columen (usually the primary key of another table)
The combination of Primary Keys and Foriegn keys form the basis of relationships between tables in a database. Looking for these in a database can tell you a lot about the structure of the database and how you should query it.
Constraints
One other big component of database design are constraints. As mentioned earlier, constraints are limitations that you can place on your data. All the data in a given column or columns must satisfy this constraint. If not you will not be able to insert that data.
There are several types of constraints in SQL, here are some common ones:
Constraint | Definition |
---|---|
NOT NULL | Column cannot contain NULL values |
DEFAULT | If there is no value for a column provided on an insert, the default value will be inserted |
UNIQUE | There can be no repeat values a column or combination of columns |
CHECK | Defines specific values that are allowed to be in the column. No other values are allowed |
The Schema
The schema is everything that defines the structure and data that appears in the database. Everything that we talked about above is part of the schema.
One last part of the schema to discuss is the column definitions.
Columns
When you create a table and it’s columns, you will provide it with a data type. This data type defines how the data is treated and what data can be inserted into it. A numeric column cannot contain string data. You can find the various data types for SQL Server here.
Data Definition and Manipulation Language
Data Defintion (DDL) and Data Maniuplation (DML) Language are two subsets of SQL that focus on creating, inserting, deleting and updating data. Whereas, Data Query Language (DQL), the subset focused on querying and pulling data, is what the first two sessions of this workshop foucused on.
DDL
Creates, deletes, or alters tables.
The most common DDL commands are DROP TABLE
and CREATE TABLE
DROP TABLE
deletes or removes a table from the database. The syntax is very simple:
DROP TABLE [tablename]
Depending on your database settings, this table will not be recoverable after deletion. Also the drop will fail if doing so will violate a constraint, like a foreign key.
CREATE TABLE
is more complicated. It defines the table name and columns. I have supplied an example of two create table statements below that provide examples of numerous constraints and data types.
CREATE TABLE Address
(INTEGER PRIMARY KEY IDENTITY NOT NULL,
AddressId INTEGER NOT NULL,
StreetNum NVARCHAR(60) NOT NULL,
Street NVARCHAR(25),
Unit NOT NULL,
MultiUnit BIT DEFAULT 0,
RandomUnit BIT NVARCHAR(25),
UnitChosen INTEGER,
TotalUnits DEFAULT 1 NOT NULL,
NeedsReturn BIT NVARCHAR(200),
Flag NVARCHAR(2),
ReturnedLetterCode INTEGER,
FinalOutcome DEFAULT 0,
FinalOutcomeCompleted BIT
RemovedBeforeMailing BIT,INTEGER,
SamplingRound Zone INTEGER,
INTEGER,
Route NOT NULL DEFAULT 0,
AM BIT NOT NULL DEFAULT 0,
PM BIT NOT NULL DEFAULT 0,
SAT BIT
Anytime BIT,
Wave1_Visited BIT,
Wave1_Targeted BIT,
Wave1_Survey BIT,
Wave2_Visited BIT,
Wave2_Targeted BIT,
Wave2_Survey BIT,NVARCHAR(50),
cwk_id DEFAULT 1,
valid_sample BIT DEFAULT 1,
In_PN BIT AS CONCAT(StreetNum, LOWER(Street), LOWER(ISNULL(UnitChosen, ISNULL(Unit, 'unit num')))),
UniqueAddrIdx
CHECK ((RandomUnit=0 AND UnitChosen IS NULL)
OR (RandomUnit=1 AND UnitChosen IS NOT NULL)),
CHECK (ReturnedLetterCode IN ('VA', 'NDAA', 'ANK', 'NSN', 'STUDENT', 'REF', 'No Apt Num', NULL)),
CHECK ((Unit IS NULL AND MultiUnit=0) OR (Unit IS NOT NULL AND MultiUnit=1))
);
CREATE UNIQUE INDEX UQ_address_idx ON Address (UniqueAddrIdx);
CREATE TABLE Visit
INTEGER PRIMARY KEY IDENTITY NOT NULL,
(VisitId INTEGER NOT NULL,
AddressId NOT NULL,
DateVisited DATETIME INTEGER NOT NULL,
Surveyor1Id INTEGER NOT NULL,
Surveyor2Id NVARCHAR(50) NOT NULL,
Outcome NOT NULL DEFAULT 0,
DidScreener BIT NVARCHAR(50) NOT NULL,
Kids NVARCHAR(300),
Notes NOT NULL DEFAULT CURRENT_TIMESTAMP,
DateEntered DATETIME
Scheduled DATETIME,NOT NULL DEFAULT 0,
NeedsReview BIT NVARCHAR(50),
AddedBy NOT NULL DEFAULT 1,
Attempt BIT
FinishedReview BIT,
CONSTRAINT FK_Address_Visit FOREIGN KEY(AddressId) REFERENCES Address (AddressId),
CONSTRAINT FK_Surveyor1_Visit FOREIGN KEY(Surveyor1Id) REFERENCES Surveyor (SurveyorId),
CONSTRAINT FK_Surveyor2_Visit FOREIGN KEY(Surveyor2Id) REFERENCES Surveyor (SurveyorId),
CONSTRAINT CK_Visit_DateVisited CHECK(DateVisited <= CURRENT_TIMESTAMP),
CONSTRAINT CK_Visit_ScheduledOutcome CHECK (Scheduled IS NOT NULL AND Outcome = 'Scheduled Survey' OR (Scheduled IS NULL AND Outcome <> 'Scheduled Survey')),
CONSTRAINT CK_Visit_KidsOutcome CHECK ((Kids = 'Did not speak with anyone'
AND Outcome IN ('Not Home', 'Non-Residential', 'Non-Existent', 'Housing Type Ineligible', 'No Matching Unit', 'Partial Address', 'Unable to Access', 'Vacant',
'Follow Up Letter', 'Missing Visit Info', 'Letter Response', 'Random Selection Form'))
OR (Kids = 'Spoke with someone but did not confirm'
AND Outcome IN ('Not Interested', 'No Adult English Speaker', 'No Caregiver Available', 'Follow Up Letter', 'Missing Visit Info', 'Letter Response'))
OR (Kids = 'Confirmed no kids live in home'
AND Outcome IN ('Not Eligible', 'No Caregiver Available', 'No Adult English Speaker', 'Missing Visit Info', 'Housing Type Ineligible', 'Letter Response'))
OR (Kids = 'Confirmed kids live in home'
AND Outcome IN ('Did Survey', 'Not Interested', 'No Caregiver Available', 'No Adult English Speaker', 'Unable to Complete Survey', 'Follow Up Letter',
'Scheduled Survey', 'Missing Visit Info', 'Letter Response'))
OR (Kids = 'Missing Visit Info')
OR (Outcome = 'Missing Visit Info')),
CONSTRAINT CK_Visit_Outcome CHECK (Outcome IN ('Did Survey', 'Unable to Complete Survey', 'Scheduled Survey', 'Not Interested', 'Not Eligible', 'No Caregiver Available', 'No Adult English Speaker',
'Not Home', 'Non-Residential', 'Housing Type Ineligible', 'Non-Existent', 'Vacant', 'Unable to Access', 'Partial Address', 'No Matching Unit Number',
'Follow Up Letter', 'Missing Visit Info', 'Letter Response', 'Random Selection Form')),
CONSTRAINT CK_Visit_Kids CHECK (Kids IN ('Confirmed kids live in home', 'Confirmed no kids live in home', 'Spoke with someone but did not confirm', 'Did not speak with anyone', 'Missing Visit Info')),
CONSTRAINT CK_Visit_Surveyors CHECK (Surveyor1Id <> Surveyor2Id),
CONSTRAINT CK_Visit_Scheduled CHECK (CONVERT(DATETIME, Scheduled) > CONVERT(DATETIME, DateVisited)),
CONSTRAINT UQ_Visit_AddressId_DateVisited UNIQUE(AddressId, DateVisited));
GO
DML
Adds, edits, or deletes rows of data within tables
The most common commands are INSERT
, UPDATE
, and DELETE
.
INSERT
reads data into a database table:
INSERT INTO Tract (GeoId, LowPop, PhiladelphiaTract)
VALUES ('42101000100', 0, 1)
DELETE
removes data from a table:
DELETE FROM tract
WHERE GeoId='42101000100'
UPDATE
edits data in a table
UPDATE Tract
SET LowPop=0
WHERE GeoId='42101000100'
Views
Views are a predefined query that you can interact with like a table. They are often used for more complicated queries or for queries that are run partciulary often.
CREATE VIEW ReviewVisits
AS
SELECT a.AddressId,
a.StreetNum,
a.Street,
a.Unit,
v.VisitId,
v.DateVisited,' ', s.FirstName, s.LastName) AS Surveyor1,
CONCAT_WS(' ', s2.FirstName, s2.LastName) AS Surveyor2,
CONCAT_WS(
v.Outcome,
v.Kids,
v.Notes,
v.DateEntered,
v.Scheduled,
v.AddedByFROM Address a
INNER JOIN Visit v ON v.AddressId = a.AddressId
INNER JOIN Surveyor s ON v.Surveyor1Id = s.SurveyorId
INNER JOIN Surveyor s2 ON v.Surveyor2Id = s2.SurveyorId
WHERE v.NeedsReview = 1 AND (v.FinishedReview <> 1 or v.FinishedReview IS NULL);
Then once the view is created you can run the query by just doing
SELECT * FROM ReviewVisits
You can even use the view in joins.