Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
/site_libs/
/_freeze/
/*/*_files/
/*/*.html
/*/*.ipynb/
TODO.md
Manifest.toml
Expand Down
11 changes: 8 additions & 3 deletions EDA/bivariate-julia.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ Putting the categorical variable first, presents a graphic (@fig-grouped-dotplot



Regardless of how the graphic is produced, there appears to be a difference in the centers based on the species, as would be expected -- different species have different sizes.
Regardless of how the graphic is produced, there appears to be a difference in the centers based on the species, as would be expected---different species have different sizes.



Expand Down Expand Up @@ -207,7 +207,7 @@ x_{1}, & x_{2}, \dots, x_{n}\\
y_{1}, & y_{2}, \dots, y_{n}
\end{align*}

Or -- to emphasize how the data is paired off -- as $(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)$.
Or---to emphasize how the data is paired off---as $(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)$.

### Numeric summaries

Expand Down Expand Up @@ -448,12 +448,17 @@ lm(@formula(PetalWidth ~ PetalLength), d)

The output has more detail to be explained later. For now, we only need to know that the method `coef` will extract the coefficients (in the first column) as a vector of length 2, which we assign to the values `bhat0` and `bhat1` below:

::: {#fig-regression-jitter}

```{julia}
scatter(jitter(l), jitter(w); legend=false) # spread out values
bhat0, bhat1 = coef(res) # the coefficients
plot!(x -> bhat0 + bhat1 * x) # `predict` does this generically
```

Scatter plot with computed regression line
:::

::: {.callout-note}
##### A constant model

Expand Down Expand Up @@ -797,7 +802,7 @@ First, suppose we simply adjust the fitted lines up or down for each cluster. Th
m2 = lm(@formula(PetalLength ~ PetalWidth + Species), iris)
```

The second row in the output of `m2` has an identical interpretation as for `m1` -- it is the slope of the regression line. The first line of the output in `m1` is the $x$-intercept, which moves the line up or down. Whereas the first of `m2` is the $x$ intercept for a line that describes *just one* of the species, in this case `setosa`. (A coding for the regression model with a categorical variable chooses one reference level, in this case "setosa."). The 3rd and 4th lines are the slopes for the other two species.
The second row in the output of `m2` has an identical interpretation as for `m1`---it is the slope of the regression line. The first line of the output in `m1` is the $x$-intercept, which moves the line up or down. Whereas the first of `m2` is the $x$ intercept for a line that describes *just one* of the species, in this case `setosa`. (A coding for the regression model with a categorical variable chooses one reference level, in this case "setosa."). The 3rd and 4th lines are the slopes for the other two species.

We can plot these individually, one-by-one, in a similar manner as before, however when we call `predict` we include a level for `:Species`. The result is the middle figure in @fig-iris-scatterplot-regression.

Expand Down
2 changes: 1 addition & 1 deletion EDA/categorical-data-julia.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -294,7 +294,7 @@ plot(p1, p2, layout = (@layout [a b]))

As seen in the left graphic of @fig-grouped-barchart, there are groups of bars for each level of the first variable (`:Sex`); the groups represent the variable passed to the `group` keyword argument. The values are looked up in the data frame with the computed column that was named `:value` through the `combine` function.

The same graphic on the left -- without the labeling -- is also made more directly with `groupedbar(freqtable(survey, :Sex, :Smoke))`
The same graphic on the left---without the labeling---is also made more directly with `groupedbar(freqtable(survey, :Sex, :Smoke))`


#### Andrews plot
Expand Down
12 changes: 10 additions & 2 deletions EDA/makie.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ Both the `mapping` and `visual` calls can be used to set attributes:

The attributes are those for the underlying plotting function. For `visual(BoxPlot)`, these can be seen at the help page for `boxplot`, displayed with the command `?boxplot`.

The `mapping` calls shows two uses of the mini language for data manipulation. The basic form is `source => function => target` and works very much like the DataFrames mini language does for `select` or `transform`, but unlike those, the function is *always* applied by row. This makes some transformations, such as $z$-scores not possible within this call -- transformations requiring the entire column need to be done within the values passed to `data`. The abbreviated forms are just `source`, as used with the `color=:species` argument; `source => function`; and `source => target`, such as `:bill_length_mm => "bill length (mm)"` used to rename the variable for labeling purposes. When the source involves more than one column selector, tuples should be used to group them.
The `mapping` calls shows two uses of the mini language for data manipulation. The basic form is `source => function => target` and works very much like the DataFrames mini language does for `select` or `transform`, but unlike those, the function is *always* applied by row. This makes some transformations, such as $z$-scores not possible within this call---transformations requiring the entire column need to be done within the values passed to `data`. The abbreviated forms are just `source`, as used with the `color=:species` argument; `source => function`; and `source => target`, such as `:bill_length_mm => "bill length (mm)"` used to rename the variable for labeling purposes. When the source involves more than one column selector, tuples should be used to group them.

A few functions are provided to bypass the usual mapping of the data. (For example, `color` maps levels of a factor to a color ramp behind the scenes.) Among these are `nonnumeric` to pass a numeric variable to a value expecting a categorical variable and `verbatim` to avoid this mapping. The latter, `=> verbatim`, will be necessary to add when annotating a figure.

Expand Down Expand Up @@ -337,7 +337,9 @@ Quantile-quantile plots. The left graphic shows `QQPlot` used to compare the dis

A scatter plot shows $x$ and $y$ pairs as points, a line plot connects these points. There are numerous ways to draw lines with the `AlgebraOfGraphics` including: `visual(Lines)`, for connect-the-dots lines; `visual(LinesFill)`, for shading; `visual(HLines)` and `visual(VLines)`, for horizontal and vertical lines; `visual(Rangebars)` to draw vertical or horizontal line segments.

The graph of a function can be drawn using `Lines`, as in this example, where we add in different range bars to emphasize the role that the two parameters play in this function's graph:
The graph of a function can be drawn using `Lines`, as in the example shown in @fig-line-plot, where we add in different range bars to emphasize the role that the two parameters play in the function's graph.

::: {#fig-line-plot}

```{julia}
ϕ(x; μ=0, σ=1) = 1/sqrt(2*pi*σ^2) * exp(-(1/(2σ)) * (x - μ)^2)
Expand All @@ -358,6 +360,9 @@ c += data((x=[1/10, 1/2], y=[0, ϕ(1)], label=["μ", "σ"])) *
draw(c)
```

Density of standard normal distribution with annotations
:::

The `Rangebars` visual has a `direction` argument, used above to make a horizontal range bar.

The annotation has two subtleties: the qualification of `Makie.Text` is needed, as there is a `Text` type in base `Julia`. More idiosyncratically, the use of `verbatim` in `mapping` is needed to avoid an attempt to map the labels to a glyph, such as a pre-defined marker.
Expand Down Expand Up @@ -416,13 +421,16 @@ f

A corner plot, as produced by the `PairPlots` package through its `pairplot` function, is a quick plot to show pair-wise relations amongst multiple numeric values. The graphic uses the lower part of a grid to show paired scatterplots with, by default, contour lines highlighting the relationship. On the diagonal are univariate density plots.

::: {#fig-pairplot}
```{julia}
using PairPlots
nms = names(penguins, 3:5)
p = select(penguins, nms .=> replace.(nms, "_mm" => "", "_" => " ")) # adjust names
pairplot(p)
```

Corner plot produced by the `PairPlots` package
:::

### 3D scatterplots

Expand Down
16 changes: 9 additions & 7 deletions EDA/tabular-data-julia.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ There are different ways to construct a data frame.

Consider the task of the Wirecutter in trying to select the best [carry on travel bag](https://www.nytimes.com/wirecutter/reviews/best-carry-on-travel-bags/#how-we-picked-and-tested). After compiling a list of possible models by scouring travel blogs etc., they select some criteria (capacity, compartment design, aesthetics, comfort, ...) and compile data, similar to what one person collected in a
[spreadsheet](https://docs.google.com/spreadsheets/d/1fSt_sO1s7moXPHbxBCD3JIKPa8QIZxtKWYUjD6ElZ-c/edit#gid=744941088).
Here we create a much simplified spreadsheet for 3 listed bags with measurements of volume, price, laptop compatibility, loading style, and a last-checked date -- as this market improves constantly.
Here we create a much simplified spreadsheet for 3 listed bags with measurements of volume, price, laptop compatibility, loading style, and a last-checked date---as this market improves constantly.

```
product v p l loads checked
Expand Down Expand Up @@ -140,7 +140,7 @@ push!(d, Dict(:b => "Genius", :v => 25, :p => 228, :lap => "Y",
:load => "clamshell", :d => Date("2022-10-01")))
```

(A dictionary is a `key => value` container like a named tuple, but keys may be arbitrary `Julia` objects -- not always symbols -- so we explicitly use symbols in the above command.)
(A dictionary is a `key => value` container like a named tuple, but keys may be arbitrary `Julia` objects---not always symbols---so we explicitly use symbols in the above command.)

::: {.callout-note}
##### The `Tables` interface
Expand Down Expand Up @@ -185,9 +185,10 @@ The filename, may be more general. For example, it could be `download(url)` for

::: {.callout-note}
##### Read and write
The methods `read` and `write` are qualified in the above usage with the `CSV` module. In the `Julia` ecosystem, the `FileIO` package provides a common framework for reading and writing files; it uses the verbs `load` and `save`. This can also be used with `DataFrames`, though it works through the `CSVFiles` package -- and not `CSV`, as illustrated above. The read command would look like `DataFrame(load(fname))` and the write command like `save(fname, df)`. Here `fname` would have a ".csv" extension so that the type of file could be detected.
The methods `read` and `write` are qualified in the above usage with the `CSV` module. In the `Julia` ecosystem, the `FileIO` package provides a common framework for reading and writing files; it uses the verbs `load` and `save`. This can also be used with `DataFrames`, though it works through the `CSVFiles` package---and not `CSV`, as illustrated above. The read command would look like `DataFrame(load(fname))` and the write command like `save(fname, df)`. Here `fname` would have a ".csv" extension so that the type of file could be detected.
:::


| Command | Description |
|---------|-------------|
| `CSV.read(file_name, DataFrame)` | Read csv file from file with given name |
Expand All @@ -197,7 +198,8 @@ The methods `read` and `write` are qualified in the above usage with the `CSV` m
| `DataFrame(load(file_name))` | Read csv file from file with given name using `CSVFiles` |
| `save(file_name, df)` | Write data frame `df` to a csv file using `CSVFiles` |

: Basic usage to read/write `.csv` file into a data frame.
: Basic usage to read/write `.csv` file into a data frame. {#tbl-read-write-data-frame}


#### TableScraper

Expand Down Expand Up @@ -266,7 +268,7 @@ can be very complicated, but here we only assume that `r"name"` will
match "name" somewhere in the string; `r"^name"` and `r"name$"` will
match "name" at the beginning and ending of a string. Using a regular
expression will return a data frame row (when a row index is
specified) -- not a value -- as it is possible to return 0, 1 or more
specified)---not a value---as it is possible to return 0, 1 or more
columns in the selection.


Expand Down Expand Up @@ -395,7 +397,7 @@ For the `cars` data set, the latter can be used to extract the Volkswagen models
cars[cars.Manufacturer .== "Volkswagen", :]
```

This approach lends itself to the description "find all rows matching some value" then "extract the identified rows," -- written as two steps to emphasize there are two passes through the data. Another mental model would be loop over the rows, and keep those that match the query. This is done generically by the `filter` function for collections in `Julia` or by the `subset` function of `DataFrames`.
This approach lends itself to the description "find all rows matching some value" then "extract the identified rows,"---written as two steps to emphasize there are two passes through the data. Another mental model would be loop over the rows, and keep those that match the query. This is done generically by the `filter` function for collections in `Julia` or by the `subset` function of `DataFrames`.

The `filter(predicate, collection)` function is used to identify just the values in the collection for which the predicate function returns `true`. When a data frame is used with `filter`, the iteration is over the rows, so the wrapping `eachrow` iterator is not needed. We need a predicate function to replace the `.==` above. One follows. It doesn't need `.==`, as `r` is a data frame row and access produces a value not a vector:

Expand Down Expand Up @@ -796,7 +798,7 @@ legos.youngest_age = categorical(legos.youngest_age, ordered=true)
first(legos[:,r"age"], 2)
```

With that ordering, an expected pattern becomes clear -- kits for older users have on average more pieces -- though there are unexpected exceptions:
With that ordering, an expected pattern becomes clear---kits for older users have on average more pieces---though there are unexpected exceptions:

```{julia}
@chain legos begin
Expand Down
Loading
Loading