Let's start by opening Spatialite:
- In your computer's file explorer, navigate to the folder where you downloaded your SpatiaLite install files.
- Start Spatialite by running the
spatialite_gui.exefile in this folder.
You can think of a database as a folder in which you keep tables that are related to each other. You don't want to put data in this database that isn't related to the other data (you could, but that's not the point of a database).
We'll need to make a new empty database to keep our spatial tables in:
- Click the "Creating a new (empty) SQLite Database"
Note: several buttons have similar names. DO NOT make a Memory-DB. It's not the same thing.
- Navigate to where you would like to keep your database, perhaps in the folder where you downloaded the data for this workshop. Name your database sfbay.sqlite because we'll be working with San Francisco Bay data. Yes, it should be all lowercase. Click OK.
Note that Spatialite can only connect to one database at a time, so if you want to work with another database, disconnect this one.
Let's add some data to our database:
- Click the "Load Shapefile" button.
Note: DO NOT load a virtual shapefile.
- Navigate to your workshop data folder.
- Select the Flowlines.shp file and click Open.
- In the Table Name box, type flowlines for the name of our new table.
- In the SRID box, type 3310 because the projection for this data is California Albers (NAD38) which has the SRID of 3310.
- In the Column Names section, pick "Convert to Lowercase" so all of our column names will be lowercase letters
- For DBF DATE Values, choose "as PlainText strings" so our dates will be human readable
- For Primary Key Column, choose "Automatic". This option will create an incremented pk_uid column to ensure we have a primary key for our table.
- Review the other options, but you can leave the defaults as they are. Click OK when you're done.
You can repeat this process to add the other two datasets. Call the WBDHU8_Points_SF table centroids and WBDHU8_SF table watersheds. The other parameters will be the same as for the flowlines table.
You should see your three tables listed in the panel on the left. Congratulations! You now have a database with files related to the San Francisco Bay's watersheds!
Now we're just about ready to do some analysis with our database. You may need to expand the Spatialite window by dragging the lower right corner of the window out so you can see everything.
A query is a request for information from the database. You will type your queries into the big blank box at the top right of the window.
You'll run the query by clicking the Execute SQL Statement button to the right of the query box.
The results of the query will appear in the box below. Sometimes the results will be a table; sometimes it will be a message.
A query has a structure. The most common one you'll see today is a "select statement". These start with the SELECT command, followed by the information you want to know, then the name of the table you want the information from, and finally (and optionally) other parameters that limit the results or provide some important caveats. All queries end with a semicolon.
Non-spatial queries are queries that don't involve the geometry column (the spatial information) of our table.
We'll start by investigating our flowlines data. The flowlines are linear features that carry water from one place to another. Some are natural features like rivers or streams, others are human-made like canals.
SELECT * FROM flowlines; The asterisk (*) means "everything" or "give me all the columns". You could read the query as "Select everything from the flowlines table."
The result should look very much like an attribute table you might expect to see in a graphical GIS, but with one additional column. The import process added geometry field. The geometry field contains information that allows the database tool to know where that particular object should be located in space, but unfortunately, it doesn't look like anything we understand as humans. We'll learn to deal with this column more in a little while.
NOTE: traditionally, reserved words like SELECT are writen in all caps, while table and column names are lower case, however, if you don't follow these rules, your code will still work.
SELECT * FROM flowlines WHERE ftype = 460; This query limits our results to just the rows where the number in the ftype column is 460, which corresponds to the natural rivers and streams (not canals). "Where" in this case does NOT indicate location, but rather a condition of the data.
SELECT COUNT(pk_uid) FROM flowlines WHERE ftype = 460; Here we've added the function COUNT(). So we've asked the database tool to count all of the IDs but only if they have an ftype of 460.
What if we wanted to know how many lines there were of each ftype?
SELECT ftype, COUNT(pk_uid) FROM flowlines GROUP BY ftype;Here, I've asked for a table with the columns FTYPE and the count of each pk_uid, and finally that it should summarize (group by) the FTYPE.
If I don't like the column name that it automatically generates - COUNT(pk_uid) - I can give it an alias with the AS command:
SELECT ftype, COUNT(pk_uid) AS count_lines FROM flowlines GROUP BY ftype; This is especially handy if you're making a table for people unfamiliar with your data or SQL or if you need the column name to be something specific.
We've just looked a few options for querying non-spatial data. See the Intro to SQL Workshop for more.
Now we'll learn about spatial queries. Spatial queries typically operate on the geometry column of a table. Sometimes it will be called "geom" instead of "geometry", but what matters is the data it contains.
Let's start understanding spatial queries by looking at the geometries column:
SELECT geometry FROM flowlines;That result doesn't tell us very much information since a BLOB is not human-readable. Let's look at it in plain text so we can read it:
SELECT ST_AsText(geometry) FROM flowlines;ST_AsText() is a function that operates on the geometry colum to let us see the geometry string in human-readable form. This isn't very useful most of the time, but perhaps it's comforting to know it's there. Note: You can make columns in the results tables larger by placing your mouse cursor over the edge of the column and dragging it out once the expander handle appears (it looks like two arrows pointing different directions).
Let's do an analysis that you might come across. Let's get the lengths of each of the flowlines:
SELECT pk_uid, ST_Length(geometry) FROM flowlines;What are the units of the length query? The units are meters because the units for the projection (California Albers; SRID 3310) are meters.
We just found the length of the individual flowlines. That was not a very informative query. It would be more useful to know what the total length of the lines are summed by their fcode.
SELECT fcode, SUM(ST_Length(geometry)) FROM flowlines GROUP BY fcode;First, let's look at our watershed polygon table so we know what's in it:
SELECT * FROM watersheds;Now, let's look at an example to get the area of the watershed polygons:
SELECT name, ST_Area(geometry) FROM watersheds;Remember that if you don't like the resulting column headings, you can alias them with AS. Note that this only changes the column name in the query output, not the underlying table.
SELECT name, ST_Area(geometry) AS area FROM watersheds;Because we are working with spatial data, we need to know how to handle projections.
When you imported your data into Spatialite, it asked you what the SRID (EPSG Code) was for your data. It's easy to forget to do this or to put in the wrong one if you're in a hurry. If you discover that you've made a mistake, you don't need to re-import the table; you can set the SRID to the correct projection with an update command.
To check your projection, you can ask the database to look in one of its internal tables. For our data, it looks like this:
SELECT * FROM geometry_columns
WHERE f_table_name = 'flowlines';For our data, setting the projection would look like this:
UPDATE flowlines SET geometry = SetSRID(geometry, 3310);This query replaces the contents of the geometry column with the results of the SetSRID command. In our case, it doesn't really do anything new since we had our projection set correctly, but you should know how to do this, so we did.
To change the projection of a dataset, you need to use the Transform or ST_Transform command.
Let's transform our watershed data into UTM Zone 10 North, the zone that San Francisco falls into.
First, we'll start with a query that results in a returning information (but doesn't make a new table):
SELECT pk_uid, huc8, name, geometry FROM watersheds;We have a table with a subset of the columns from the original watersheds table. Now let's work on transforming the geom column. We'll add a function around the geometry column to reproject the data. 26910 is the SRID for UTM Zone 10 N.
SELECT pk_uid, huc8, name, ST_Transform(geometry, 26910) FROM watersheds;It may look like nothing happened, but the column heading on the geometry column should have changed. Finally, we'll want do something to keep this data. Remember that a SELECT statement just returns information, it doesn't save it. We have two options. (1) We could overwrite the geometry column of the watersheds table, but that will mean the projection won't match the other data we have. (2) The other option is to make a new table with a different projection. We can do this by adding a CREATE TABLE statement to the query we already have (adding an alias to the geometry column):
CREATE TABLE watersheds_utm_10n AS
SELECT pk_uid, huc8, name, ST_Transform(geometry, 26910) AS geometry
FROM watersheds;To see the new table in the list, we'll need to refresh our database. On the left panel, right click on "User Data" and select Refresh. Now we can see the table, but it's just a table. Notice that the icon is different from the other tables. The database doesn't seem to know the table is actually polygons.
SELECT * FROM watersheds_utm_10n; Shows that all the columns we asked for, including the geometry column are there. What's going on? We need to recover the geometry column so the database will recognize the table as a spatial table.
SELECT RecoverGeometryColumn('watersheds_utm_10n', 'geometry', 26910, 'MULTIPOLYGON', 'XY');This query will return a single column and row with the number 1 in the only cell. If it returns 0, the query didn't work.
Now we need to vacuum the database (yes, that sounds a little odd). Run the command VACUUM; in the SQL Query panel. This will clean up any issues created by all the changes we just made to the database.
You may need to refresh your database list again before the icon will change.
Spatial joins allow us to combine information from one table with another based on the location associated with each record. When we write a query involving two or more tables, we need to specify which table any column names come from. We do this by giving the table name before the column - table.column . Let's see if we can figure out which watershed each of our flowlines is in:
SELECT flowlines.*, watersheds.name AS watershed_name
FROM flowlines, watersheds
WHERE ST_Contains(watersheds.geometry, flowlines.geometry);Your table should look just like your flowlines table, but we've added the NAME column from our watersheds table (but called it "Watershed_Name" because this will make more sense if we needed to use the this data later and didn't remember where this information came from).
ST_Contains tells us if a line is completely within a particular watersheds polygon. How would you change this query to identify which watershed each line intersects rather than is contained by? Hint: SpatiaLite Function Reference List
We've discussed that one of the big reasons to use a database is not needing to save a bunch of intermediate data products. You just keep your data in tables and you write queries to combine them in various ways. But what if you find yourself needing the results a particular query quite often? Do you save a table? There's another option: the view. A view is like a saved query. It behaves like a table so you can call it in other queries or view it in QGIS, but the records in the table are generated by running a query rather than storing the same data over again.
Let's make a view! Let's say we want to have a view that contains all the rivers (so, FTYPE = 460, like before) that are in the Tomales-Drake Bays watershed and let's put the watershed name into the table as well.
First, let's make a query to make sure we're getting the information we want.
SELECT flowlines.*, watersheds.name AS watershed_name
FROM flowlines, watersheds
WHERE ST_Contains(watersheds.geometry, flowlines.geometry)
AND ftype = 460
AND watersheds.name LIKE 'Tomales-Drake Bays';That query looks like what we want. Let's make it into a view. View creation sytax is really similar to the way you make a table:
CREATE VIEW rivers_tomales AS
SELECT flowlines.*, watersheds.name AS watershed_name
FROM flowlines, watersheds
WHERE ST_Contains(watersheds.geometry, flowlines.geometry)
AND ftype = 460
AND watersheds.name LIKE 'Tomales-Drake Bays';Refresh your table list to see if on the left side of the screen. This is a non-spatial view. To have the database recognize it as a spatial view, we need one more step that registers the view in the table that keeps track of which views are spatial. (If this sounds complicated, just file this away as a step that needs to get done.)
INSERT INTO views_geometry_columns
(view_name, view_geometry, view_rowid, f_table_name, f_geometry_column, read_only)
VALUES ('rivers_tomales', 'geometry', 'pk_uid', 'flowlines', 'geometry', 0);What does this query do? I'll break it down.
INSERT INTO views_geometry_columns tells the databse to put data into the views_geometry_columns table. This is a table that comes standard with Spatialite databases. It was there as soon as we made the database. It tells the rest of the databse which views are spatial. There's a similar table for listing spatial tables.
(view_name, view_geometry, view_rowid, f_table_name, f_geometry_column, read_only) is the list of columns we want to put information into in the views_geometry_columns table. We'll be making a new row of data in this table, and these are the columns where that data is going to go.
VALUES ('rivers_tomales', 'geometry', 'pk_uid', 'flowlines', 'geometry', 0)
VALUES says to the database, "here is the list of items to put in the columns I told you about in the last line", then we put the list of things in parentheses. So 'rivers_tomales' (the new table name) will go into the 'view_name' field. 'geometry' is the geometry column for the view so it goes into the 'view_geometry' field of the table. 'flowlines' is the table that our 'rivers_tomales' view inherits its spatial data from and it has a geometry column called 'geometry' as well. Finally, 'read_only' takes either a 0 to make the table read-only or 1 to make it writable. Read-only is a good choice here since we won't be adding or changing the view's contents.
Refresh your table list to see that rivers_tomales is now a spatial view.
SELECT * FROM rivers_tomales; to see your new view. It will function just like a table.
Tables and views are usefull, but this is spatial data so it might be nice to look at the data in map form. We can look at our spatial tables and views directly in QGIS. Let's look at some of our tables and views:
- Open QGIS and start a new project
- In the Browser Panel (usually on the left by default... add it if it's missing: View menu -> Panels -> Browser Panel). Scroll down to find the SpatiaLite icon. If you've previously connected SpatiaLite tables, there will be a small triangle to expland the list. If not, it's just the icon and text label.
- Right click on the Spatialite section in the Browser Panel and select New connection
- Navigate to and select your sfbay.sqlite database. Click Open.
- In the Browser Panel, expand your sfbay.sqlite database to see the tables and views. It will list the tables we created plus some default tables that come with the database that we don't need to worry about.
- Double click the tables or views to add them to the map canvas.
Note: Tables and (especially) views with a lot of records will take a while to load. If a dialog asks about the CRS transformation you want to use, select an option that makes sense for your data, then click OK. I advise adding one at a time and saving your QGIS project after each table addition just in case it crashes.
Now you can access the symbology, labels, and other standard tools in the Layer Properties for each layer, just as you would any other spatial dataset in QGIS.
NOTE: QGIS' DB Manager tool is an alternative to Spatialite that comes standard with QGIS and can connect to a number of spatial databases.
Not surprisingly, you can use a spatial database to do more than just get lengths and areas of existing geometries, or change projections. Let's go back to Spatialite and explore some more spatial analysis possibilities.
Let's find out which watershed is closest to the city of San Francisco. We could go about this a number of ways, but let's find the distance from the city's center point to the centroid of each watershed:
SELECT ST_Distance(MakePoint(37.7749, -122.4194, 3310), centroids.geometry)
FROM centroids;Here we used MakePoint() to turn a set of latitude/longitude coordinates into a format that the database tool understands in the CA Albers projection (EPSG 3310), then put the results into the Distance() function.
How could you make this table more informative? Could you add or rename some columns?
One interesting thing about SQL is that you can nest functions to do a series of functions in one query like you just saw above, but it can get more complex. For example, maybe I want to find out the area (in square kilometers) within 1 kilometer of all the flowlines.
SELECT sum(ST_Area(ST_Buffer(geometry, 1000)))/1000000 FROM flowlines;Here, we take the sum of the area of the buffer of 1000 meters, then divide the whole thing by 1,000,000 to convert square meters to square kilometers. Wow. That's pretty complicated. But I didn't have to make a bunch of intermediate files and add columns to an attribute table, then save a CSV, then sum it all up in Excel. Now which option sounds crazier? Perhaps you're starting to see some of the power of spatial SQL.