WIP use Tables.Columns instead of `columntable` by kleinschmidt · Pull Request #247 · JuliaStats/StatsModels.jl

kleinschmidt · 2021-10-15T17:16:40Z

This uses Tables.Columns as a (potentially) lightweight wrapper around input tables that does not convert them to the strongly typed NamedTuple of Vectors representation. This might make some things easier on the compiler (e.g. #220 ).

Requires Tables 1.6.0 since that's when Columns stopped being a lie ;)

There's some design issues to work out here still, since a generic NamedTuple could be EITHER a column table (if it contains vectors) or a single row, and there are a handful of methods that specialize on that to provide special handling (most notably modelcols(::InteractionTerm, ...)). What we PROBABLY will need to do is to add parallel methods for Row in a similar fashion, but I'm not sure about that. In the mean time, merging this would be breaking since you lose first class support for named tuples of singletons, which is part of the current public API. There may be a way around that but I haven't dug into it yet...

nalimilan · 2023-01-27T07:08:10Z

There's some design issues to work out here still, since a generic NamedTuple could be EITHER a column table (if it contains vectors) or a single row, and there are a handful of methods that specialize on that to provide special handling (most notably modelcols(::InteractionTerm, ...)). What we PROBABLY will need to do is to add parallel methods for Row in a similar fashion, but I'm not sure about that. In the mean time, merging this would be breaking since you lose first class support for named tuples of singletons, which is part of the current public API. There may be a way around that but I haven't dug into it yet...

For which situations do we need modelcols to take a row? For clarity, we could require row objects to be Tables.AbstractRow even if Tables.jl doesn't require that. Otherwise confusing things could happen (weird errors when the Tables.istable check fails for some reason...).

nalimilan · 2023-01-27T06:56:55Z

src/modelframe.jl

+    cols = termvars(formula)
+    materialize = Tables.materializer(data)
+    data = materialize(TableOperations.select(cols...)(data))
+    drop = TableOperations.narrowtypes() ∘ TableOperations.dropmissing()


AFAICT TableOperations.dropmissing operates row-wise (it calls filter). I'm afraid this is going to kill performance for data frames.

Maybe an optimized method for column tables could be added? (EDIT: That's probably doable, as we can use a faster approach than filter since we know that the condition can be computed separately for each row.) Another solution would be to define dropmissing in DataAPI, say that dropmissing(::Any) is owned by TableOperations, but have dropmissing(::DataFrame) be defined in DataFrames.

Also, narrowtypes is a much more costly operation that just doing nonmissingtype(eltype(col)) as it requires going over all entries. DataFrames's dropmissing does that by default, maybe TableOperations could take a similar argument.

nalimilan · 2023-01-27T07:02:03Z

src/schema.jl

+function schema(ts::AbstractVector{<:AbstractTerm},
+                data,
+                hints::Dict{Symbol}=Dict{Symbol,Any}())
+    data = Tables.Columns(Tables.columns(data))


What is the advantage of wrapping the result of Tables.columns in a Tables.Columns object?

nalimilan · 2023-01-27T07:02:51Z

src/schema.jl

 # if the "hint" is already an AbstractTerm, use that
 # need this specified to avoid ambiguity
-concrete_term(t::Term, d::ColumnTable, hint::AbstractTerm) = hint
+concrete_term(t::Term, d::Tables.Columns, hint::AbstractTerm) = hint


Why not just

Suggested change

concrete_term(t::Term, d::Tables.Columns, hint::AbstractTerm) = hint

concrete_term(t::Term, d, hint::AbstractTerm) = hint

kleinschmidt added 2 commits September 13, 2021 08:52

WIP replace ColumnTable with Columns

e6a5365

WIP

d7d1283

kleinschmidt mentioned this pull request Oct 15, 2021

fit is very slow for new formulas #220

Open

kleinschmidt added 6 commits October 25, 2021 16:51

Tables does this for us

8a00097

tables compat for Columns

89adab5

use Columns

3810c48

make it say Vector

8dcbf62

need to matieralize after select before dropmissing

6032e6e

do we _really_ need to specialize on Columns?

59601d8

kleinschmidt mentioned this pull request Jan 24, 2023

roadmap to 1.0 release #271

Open

5 tasks

nalimilan reviewed Jan 27, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP use Tables.Columns instead of `columntable`#247

WIP use Tables.Columns instead of `columntable`#247
kleinschmidt wants to merge 8 commits intomasterfrom
dfk/columns

kleinschmidt commented Oct 15, 2021

Uh oh!

nalimilan commented Jan 27, 2023

Uh oh!

nalimilan Jan 27, 2023 •

edited

Loading

Uh oh!

nalimilan Jan 27, 2023

Uh oh!

nalimilan Jan 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	concrete_term(t::Term, d::Tables.Columns, hint::AbstractTerm) = hint
	concrete_term(t::Term, d, hint::AbstractTerm) = hint

Conversation

kleinschmidt commented Oct 15, 2021

Uh oh!

nalimilan commented Jan 27, 2023

Uh oh!

nalimilan Jan 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nalimilan Jan 27, 2023

Choose a reason for hiding this comment

Uh oh!

nalimilan Jan 27, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nalimilan Jan 27, 2023 •

edited

Loading