WIP use Tables.Columns instead of columntable#247
WIP use Tables.Columns instead of columntable#247kleinschmidt wants to merge 8 commits intomasterfrom
columntable#247Conversation
For which situations do we need |
| cols = termvars(formula) | ||
| materialize = Tables.materializer(data) | ||
| data = materialize(TableOperations.select(cols...)(data)) | ||
| drop = TableOperations.narrowtypes() ∘ TableOperations.dropmissing() |
There was a problem hiding this comment.
AFAICT TableOperations.dropmissing operates row-wise (it calls filter). I'm afraid this is going to kill performance for data frames.
Maybe an optimized method for column tables could be added? (EDIT: That's probably doable, as we can use a faster approach than filter since we know that the condition can be computed separately for each row.) Another solution would be to define dropmissing in DataAPI, say that dropmissing(::Any) is owned by TableOperations, but have dropmissing(::DataFrame) be defined in DataFrames.
Also, narrowtypes is a much more costly operation that just doing nonmissingtype(eltype(col)) as it requires going over all entries. DataFrames's dropmissing does that by default, maybe TableOperations could take a similar argument.
| function schema(ts::AbstractVector{<:AbstractTerm}, | ||
| data, | ||
| hints::Dict{Symbol}=Dict{Symbol,Any}()) | ||
| data = Tables.Columns(Tables.columns(data)) |
There was a problem hiding this comment.
What is the advantage of wrapping the result of Tables.columns in a Tables.Columns object?
| # if the "hint" is already an AbstractTerm, use that | ||
| # need this specified to avoid ambiguity | ||
| concrete_term(t::Term, d::ColumnTable, hint::AbstractTerm) = hint | ||
| concrete_term(t::Term, d::Tables.Columns, hint::AbstractTerm) = hint |
There was a problem hiding this comment.
Why not just
| concrete_term(t::Term, d::Tables.Columns, hint::AbstractTerm) = hint | |
| concrete_term(t::Term, d, hint::AbstractTerm) = hint |
This uses
Tables.Columnsas a (potentially) lightweight wrapper around input tables that does not convert them to the strongly typedNamedTupleofVectors representation. This might make some things easier on the compiler (e.g. #220 ).Requires Tables 1.6.0 since that's when
Columnsstopped being a lie ;)There's some design issues to work out here still, since a generic
NamedTuplecould be EITHER a column table (if it contains vectors) or a single row, and there are a handful of methods that specialize on that to provide special handling (most notablymodelcols(::InteractionTerm, ...)). What we PROBABLY will need to do is to add parallel methods forRowin a similar fashion, but I'm not sure about that. In the mean time, merging this would be breaking since you lose first class support for named tuples of singletons, which is part of the current public API. There may be a way around that but I haven't dug into it yet...