Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions episodes/01-relational-database.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Databases are designed to allow efficient querying against very large tables, mo

## What is a table?

As were have noted above, a single table is very much like a spreadsheet. It has rows and it has columns. A row represents a single observation and the columns represents the various variables contained within that observation.
As we have noted above, a single table is very much like a spreadsheet. It has rows and it has columns. A row represents a single observation and the columns represent the various variables contained within that observation.
Often one or more columns in a row will be designated as a 'primary key' This column or combination of columns can be used to uniquely identify a specific row in the table.
The columns typically have a name associated with them indicating the variable name. A column always represents the same variable for each row contained in the table. Because of this the data in each column will always be of the same *type*, such as an Integer or Text, of values for all of the rows in the table. Datatypes are discussed in the next section.

Expand Down Expand Up @@ -108,7 +108,7 @@ for these and use the built-in Date And Time Functions to manipulate them. We wi

## Why do tables have primary key columns?

Whenever you create a table, you will have the option of designating one of the columns as the primary key column. The main property of the primary key column is that the values contained in it must uniquely identify that particular row. That is you cannot have duplicate primary keys. This can be an advantage which adding rows to the table as you will not be allowed to add the same row (or a row with the same primary key) twice.
Whenever you create a table, you will have the option of designating one of the columns as the primary key column. The main property of the primary key column is that the values contained in it must uniquely identify that particular row. That is you cannot have duplicate primary keys. This can be an advantage when adding rows to the table as you will not be allowed to add the same row (or a row with the same primary key) twice.

The primary key column for a table is usually of type Integer although you could have Text. For example if you had a table of car information, then the "Reg\_No" column could be made the primary key as it can be used to uniquely identify a particular row in the table.

Expand Down
6 changes: 3 additions & 3 deletions episodes/02-db-browser.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,13 +52,13 @@

![](fig/DB_Browser_run_2.png){alt='Data Browser Preferences'}

Towards the bottom there is a section dealing with Field colors. You will see three bars below the word Text, to the right there are in fact three invisible bars for the Background. Click in the area for the Background color for NULL. A colour selector window will open, select Red. The bar will turn Red. This is now the default background cell colour that will be used to display NULL values in you tables. We will discuss the meaning of NULL values in a table in a later episode.
Towards the bottom there is a section dealing with Field colors. You will see three bars below the word Text, to the right there are in fact three invisible bars for the Background. Click in the area for the Background color for NULL. A colour selector window will open, select Red. The bar will turn Red. This is now the default background cell colour that will be used to display NULL values in your tables. We will discuss the meaning of NULL values in a table in a later episode.

You can now close the preference window by clicking OK.

## Opening a database

For this lesson we will be making extensive use of the SQL\_SAFI database. If you do not already have a copy of this database you can download it from [here](data/SQL_SAFI.sqlite).

Check warning on line 61 in episodes/02-db-browser.md

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[uninformative link text]: [here](data/SQL_SAFI.sqlite)

To open the database in DB Browser do the following;

Expand All @@ -76,7 +76,7 @@
![](fig/DB_Browser_run_3.png){alt='Table Actions'}

If you select 'Browse Table', the data from the table is loaded into the 'Browse Data' pane from where it can be examined or filtered.
You can also select the table you wish to Browse directly from here.
You can also select the table you wish to browse directly from here.

There are options for 'New Record' and 'Delete Record'. As our interest is in analysing existing data not creating or deleting data, it is unlikely that you will want to use these options.

Expand All @@ -97,7 +97,7 @@
On the toolbar at the top there are eight buttons. Left to right they are:

- Open Tab (creates a new tab in the editor)
- Open SQL file (allows you to load a prepared file of SQL into the editor - the tab takes the name of he file)
- Open SQL file (allows you to load a prepared file of SQL into the editor - the tab takes the name of the file)
- Save SQL file (allows you to save the current contents of the active pane to the local file system)
- Execute SQL (Executes all of the SQL statements in the editor pane)
- Execute current line (Actually executes whatever is selected)
Expand Down
2 changes: 1 addition & 1 deletion episodes/03-select.md
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ WHERE B17_parents_liv = 'yes'
;
```

Notice that the columns being used in the `WHERE` clause do not need to returned as part of the `SELECT` clause.
Notice that the columns being used in the `WHERE` clause do not need to be returned as part of the `SELECT` clause.

You can ensure the precedence of the operators by using brackets. Judicious use of brackets can also aid readability

Expand Down
6 changes: 3 additions & 3 deletions episodes/04-missing-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ exercises: 0
At the beginning of this lesson we noted that all database systems have the concept of a NULL value; Something which is missing and nothing is known about it.

In DB Browser we can choose how we want NULLs in a table to be displayed. When we had our initial look at DB Browser,
we used the `View | Preference` option to change the background colour of cells in a table which has a `NULL` values as **red**.
we used the `View | Preference` option to change the background colour of cells in a table which has `NULL` values as **red**.
The example below, using the 'Browse data' tab, shows a section of the Farms table in the SQL\_SAFI database showing column values which are `NULL`.

![](fig/SQL_04_Nulls_01.png){alt='Farms NULLs'}
Expand Down Expand Up @@ -78,10 +78,10 @@ the value of `NULL` is appropriate.

## Dealing with missing data

There are several statistical techniques that can be used to allow for `NULL` values, which one you might will depend on what has caused the `NULL` value to be recorded.
There are several statistical techniques that can be used to allow for `NULL` values. Which one you might use will depend on what has caused the `NULL` value to be recorded.

You may want to change the `NULL` value to something else. For example if we knew that the `NULL` values in the `F14_items_owned` column actually meant that the Farmer had no possessions then we
might want to change the `NULL` values to '[]' to represent and empty list. We can do that in SQL with an `UPDATE` query.
might want to change the `NULL` values to '[]' to represent an empty list. We can do that in SQL with an `UPDATE` query.

The update query is shown below. We are not going to run it as it would change our data.
You need to be very sure of the effect you are going to have before you change data in this way.
Expand Down
10 changes: 5 additions & 5 deletions episodes/05-creating-new-columns.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,11 +61,11 @@
## Using built-in functions to create new values

In addition to using simple arithmetic operations to create new columns, you can also use some of the SQLite built-in functions.
Full details of the available built-in functions are available from the SQLite.org website [here](https://sqlite.org/lang_corefunc.html#instr).

Check warning on line 64 in episodes/05-creating-new-columns.md

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[uninformative link text]: [here](https://sqlite.org/lang_corefunc.html#instr)

We will look at some of the arithmetic and statistical functions when we deal with aggregations in a later lesson.

You may have noticed in the output from are last query that the number of decimal places can change from one row to another. In order to make the output
You may have noticed in the output from our last query that the number of decimal places can change from one row to another. In order to make the output
more tidy, we may wish to always produce the same number of decimal places, e.g. 2. We can do this using the `ROUND` function.

The `ROUND` function works in a similar way as its spreadsheet equivalent, you specify the value you wish to round and the required number of decimal places.
Expand Down Expand Up @@ -113,10 +113,10 @@
| substr(a,b,c) | mid(a,b,c) |
| instr(a,b) | find(a,b) |

`instr` can be used to check a character or string of characters occurs within another string.
`instr` can be used to check if a character or string of characters occurs within another string.
`substr` can be used to extract a portion of a string based on a starting position and the number of characters required.

In the Farms table, the three columns A01\_interview\_date, A04\_start and A05\_end are all recognisable as a dates with the A04\_start and A05\_end also including times.
In the Farms table, the three columns A01\_interview\_date, A04\_start and A05\_end are all recognisable as dates with the A04\_start and A05\_end also including times.
These last two are automatically generated by the eSurvey software when the data is collected, i.e. they are automatically entered. The A01\_interview\_date however is manually input.
In all three cases however SQLite thinks that they are all just strings of characters.
We can confirm this by selecting the `Database Structure` tab and expanding the `Farms` entry and notice that the data type for all three columns is listed as 'TEXT'
Expand Down Expand Up @@ -268,7 +268,7 @@
```

By default the `ORDER BY` clause will sort in ascending order, smallest to
biggest; we can make this explicit by usingthe `ASC` keyword. Or if we want to
biggest; we can make this explicit by using the `ASC` keyword. Or if we want to
sort in descending order we can use the `DESC` keyword.

```sql
Expand Down Expand Up @@ -296,7 +296,7 @@
;
```

There is a more general form which allows to to perform any kind of test.
There is a more general form which allows us to perform any kind of test.

## Using SQL syntax to create ‘binned' values

Expand Down
8 changes: 4 additions & 4 deletions episodes/06-aggregation.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@

## Using built-in statistical functions

Aggregate functions are used perform some kind of mathematical or statistical calculation across a group of rows. The rows in each group are determined
Aggregate functions are used to perform some kind of mathematical or statistical calculation across a group of rows. The rows in each group are determined
by the different values in a specified column or columns. Alternatively you can aggregate across the entire table.

If we wanted to know the minimum, average and maximum values of the 'A11\_years\_farm' column across the whole Farms table, we could write a query such as this;
Expand All @@ -38,7 +38,7 @@
This sort of query provides us with a general view of the values for a particular column or field across the whole table.

`min` , `max` and `avg` are built-in aggregate functions in SQLite (and any other SQL database system). There are other such functions available.
A complete list can be found in the SQLite documentation [here](https://sqlite.org/lang_aggfunc.html).

Check warning on line 41 in episodes/06-aggregation.md

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[uninformative link text]: [here](https://sqlite.org/lang_aggfunc.html)

It is more likely that we would want to find such values for a range, or multiple ranges of rows where each range is determined by the
values of some other column in the table. Before we do this we will look at how we can find out what different values are contained in a given column.
Expand Down Expand Up @@ -76,7 +76,7 @@

![](fig/SQL_06_villages.png){alt='Villages'}

The problem with allowing free-form text quite obvious. Having two villages, one called 'Massequece' and the other called 'Massequese' is unlikely.
The problem with allowing free-form text may be quite obvious. Having two villages, one called 'Massequece' and the other called 'Massequese' is unlikely.

Detecting this type of problem in a large dataset can be very difficult if you are just 'eyeballing' the content. This small SQL query makes it very clear,
and in the OpenRefine lesson we provide approaches to detecting and correcting such errors. SQL is not the best tool for correcting this type of error.
Expand Down Expand Up @@ -110,7 +110,7 @@

## The `GROUP BY` clause to summarise data

Just knowing the combinations is of limited use. You really want to know **How many** of each of the values there are.
Just knowing the combinations is of limited use. You really want to know **how many** of each of the values there are.
To do this we use the `GROUP BY` clause.

```sql
Expand All @@ -124,7 +124,7 @@

In the first example of this episode, three aggregations were performed over the single column 'A11\_years\_farm'.
In addition to calculating multiple aggregation values over a single column, it is also possible to aggregate over multiple columns by specifying
them in all in the `SELECT` clause **and** the `GROUP BY` clause.
them all in the `SELECT` clause **and** the `GROUP BY` clause.

The grouping will take place based on the order of the columns listed in the `GROUP BY` clause. There will be one row returned for each unique combination of the columns mentioned in the `GROUP BY` clause

Expand Down
14 changes: 7 additions & 7 deletions episodes/07-creating-tables-views.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,16 +77,16 @@ If any of the datatypes are not as expected or wanted we can change them.
In this particular case DB Browser correctly selected the datatypes. Notice that the `A01_interview_date` was allocated a datatype of 'TEXT'. This isn't a problem
as we have to use the Date and Time functions to manipulate dates anyway.

Notice that the bottom pane in the Window shows the SQL DDL statement that would create the table that you modifying.
Notice that the bottom pane in the Window shows the SQL DDL statement that would create the table that you are modifying.

When you change one of the columns from TEXT to INTEGER, this is immediately reflected in the Create Table statement.
It is slightly misleading because in fact we are modifying an existing table and in SQL-speak, this would be an **Alter Table...** statement.
However it does illustrate quite well the fact that whatever you do in the GUI, it is essentially translated into an SQL statement and executed.
You could copy and paste this definition into the SQL editor and if you change the table name before you ran it, you would create a new table with that name.
You could copy and paste this definition into the SQL editor and if you changed the table name before you ran it, you would create a new table with that name.
This new table would have no data in it. This is how the insert table wizard works. It uses the header row from your data to create a `CREATE TABLE` statement which it runs.
It then transforms each of the rows of data into SQL `INSERT INTO...` statements which it also runs to get the data into the table.

In addition to changing the data types there are several other options which can be set when you are creating of modifying a table.
In addition to changing the data types there are several other options which can be set when you are creating or modifying a table.
For our tables we don't need to make use of them but for completeness we will describe what they are;

**PK** - Or Primary Key, a unique identifier for the row. In the Farms table, there is an `Id` column which uniquely identifies a Farm.
Expand All @@ -98,7 +98,7 @@ This could act as a unique identifier for the row as a whole. We could mark thi

In real datasets missing values are quite common and we have already looked at ways of dealing with them when they occur in tables. If you were to **check** this box and the data did have missing values for this column, the record from the file would be rejected and the load of the file will fail.

**U** - Or Unique. This allows you to say that the contents of the column, which is not the primary key column has to have unique values in it. Like Allow Null this is another way of providing some data validation as the data is imported. Although it doesn't really apply with the DB Browser import wizard as the data is imported before you are allowed to set this option.
**U** - Or Unique. This allows you to say that the contents of the column, which is not the primary key column, has to have unique values in it. Like Allow Null, this is another way of providing some data validation as the data is imported (although it doesn't really apply with the DB Browser import wizard as the data is imported before you are allowed to set this option).

**Default** - This is used in conjunction with 'Not Null', if a value is not provided in the dataset, then if provided, the default value for that column will be used.

Expand Down Expand Up @@ -133,7 +133,7 @@ line added.

## Creating a table using an SQL command

You could copy and paste this definition into the SQL editor and if you change the table name before you ran it, you would create a new table with that name.
You could copy and paste this definition into the SQL editor and if you changed the table name before you ran it, you would create a new table with that name.
This new table would have no data in it. This is how the insert table wizard works. It uses the header row from your data to create a `CREATE TABLE` statement which it runs.
It then transforms each of the rows of data into SQL `INSERT INTO...` statements which it also runs to get the data into the table.

Expand Down Expand Up @@ -172,7 +172,7 @@ SELECT Id,
FROM Farms;
```

If we wanted to create a table from the Crops table which contains only the rows where the D\_curr\_crop value was 'rice' we could use a query like this:
If we wanted to create a table from the Crops table, which contains only the rows where the D\_curr\_crop value was 'rice' we could use a query like this:

```sql
CREATE TABLE crops_rice AS
Expand Down Expand Up @@ -215,7 +215,7 @@ The advantage of using Views is that it allows you to restrict how you see the d
In the example we used above it may be far easier to work with only the 6 columns that we need from the full Farms table
rather than the full table with 61 columns.

A View isn't restricted to simple `SELECT` statements it can be the result of aggregations and joins as well.
A View isn't restricted to simple `SELECT` statements. It can be the result of aggregations and joins as well.
This can help reduce the complexity of queries based on the View and so aid readability.

:::::::::::::::::::::::::::::::::::::::: keypoints
Expand Down
Loading
Loading