ADALabUCSD.github.io/triptych.jemdoc at master · ADALabUCSD/ADALabUCSD.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
# jemdoc: menu{MENU2}{triptych.html}
= ADA Lab @ UCSD

~~~
*Note:* This umbrella project webpage is now deprecated.
Please see the webpages of the active projects Cerebro and SortingHat.
~~~

~~~
{}{img_left}{images/triptych.jpg}{}{100px}{}{}
== Project Triptych
~~~

=== Overview

Triptych is an end-to-end /model selection management system/ (MSMS) that aims to simplify
and accelerate the process of sourcing data\/features and selecting ML models. Our guiding
principles are to exploit the semantics of the data and the ML task to the extent possible
to reduce work for the data scientist and reduce runtimes and costs. We apply these
principles to remove or mitigate different bottlenecks in this end-to-end process,
eventually unifying these components to yield an integrated ``operating system'' for ML
analytics tasks. Please refer to the ACM SIGMOD Record paper below for more details of
this vision.

=== Active Component Projects

~~~
{}{img_left}{images/cerebro.jpg}{}{80px}{}{}
[cerebro.html *Cerebro*]\n
Efficient and reproducible model selection on deep learning systems.
~~~

~~~
{}{img_left}{images/morpheus.jpg}{}{80px}{}{}
[morpheus.html *Morpheus*]\n
Integrating linear algebra and relational algebra to simplify feature engineering for ML.
~~~

~~~
{}{img_left}{images/sortinghat.jpg}{}{80px}{}{}
[sortinghat.html *SortingHat*]\n
ML schema inference and automatic data preparation.
~~~


=== Publications

- Some Damaging Delusions of Deep Learning Practice (and How to Avoid Them)\n
Arun Kumar, Supun Nakandala, and Yuhao Zhang\n
KDD 2021 Deep Learning Day | [papers/2021_DLDelusions_KDD.pdf Extended Abstract PDF]
| [papers/2021_DLDelusions_KDD_Slides.pdf Talk slides]
| [https://www.youtube.com/watch?v=UP9__WsfSuc Talk video]

- Towards an Optimized GROUP BY Abstraction for Large-Scale Machine Learning\n
Side Li and Arun Kumar\n
VLDB 2021 | [papers/2021_Kingpin_VLDB.pdf Paper PDF] | [papers/TR_2021_Kingpin.pdf TechReport] | [https://www.youtube.com/watch?v=OlTknBfBmvM Talk video] | [https://github.com/liside/Kingpin Code Release]

- Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches\n
Yuhao Zhang, Frank McQuillan, Nandish Jayaram, Nikhil Kak, Ekta Khanna, Orhan Kislal, Domino Valdano, and Arun Kumar\n
VLDB 2021 | [papers/2021_Cerebro-DS.pdf Paper PDF] | [papers/TR_2021_Cerebro-DS.pdf TechReport] | [https://youtu.be/SK9wTzO4K7M Talk video] | [https://github.com/makemebitter/cerebro-ds/ Code release]

- Intermittent Human-in-the-Loop Model Selection using Cerebro: A Demonstration\n
Liangde Li, Supun Nakandala, and Arun Kumar\n
VLDB 2021 Demo | [papers/2021_Cerebro_VLDB_Demo.pdf Paper PDF] | [papers/TR_2021_Intermittent_HIL_MS.pdf TechReport] | [https://youtu.be/K3THQy5McXc Video]

- Towards A Polyglot Framework for Factorized ML\n
David Justo, Shaoqing Yi, Lukas Stadler, Nadia Polikarpova, and Arun Kumar\n
VLDB 2021 (Industrial Track) | [papers/2021_Trinity_VLDB.pdf Paper PDF] | [papers/TR_2021_Trinity.pdf TechReport] | [https://www.youtube.com/watch?v=osvBmZs2MsM Talk video] | Code coming soon

- Towards Benchmarking Feature Type Inference for AutoML Platforms\n
Vraj Shah, Jonathan Lacanlale, Premanand Kumar, Kevin Yang, and Arun Kumar\n
ACM SIGMOD 2021 | [papers/2021_SortingHat_SIGMOD.pdf Paper PDF] | [papers/TR_2021_SortingHat.pdf TechReport] | Talk Videos: [https://youtu.be/KAs-uU59AEM Short Talk] [https://youtu.be/dpx74zQyU3k Long Talk] | [https://github.com/pvn25/ML-Data-Prep-Zoo/tree/master/MLFeatureTypeInference Data, Code, and Pre-trained Models on GitHub] | [https://github.com/pvn25/ML-Data-Prep-Zoo/tree/master/MLFeatureTypeInference/Library Python library]

- The CNN Hip Accelerometer Posture (CHAP) Method for Classifying Sitting Patterns from Hip Accelerometers: A Validation Study\n
Mikael Anne Greenwood-Hickman, Supun Nakandala, Marta M. Jankowska, Fatima Tuz-Zahra, John Bellettiere, Jordan Carlson, Paul R. Hibbing, Jingjing Zou, Andrea Z. LaCroix, Arun Kumar, and Loki Natarajan\n
Medicine and Science in Sports and Exercise Journal, 2021 | Paper PDF coming soon | [https://github.com/ADALabUCSD/DeepPostures Code]

- Application of Convolutional Neural Network Algorithms for Advancing Sedentary and Activity Bout Classification\n
 Supun Nakandala, Marta Jankowska, Fatima Tuz-Zahra, John Bellettiere, Jordan Carlson, Andrea LaCroix, Sheri Hartman, Dori Rosenberg, Jingjing Zou, Arun Kumar, and Loki Natarajan\n
 Journal for the Measurement of Physical Behaviour, 2021 | [papers/2021_JMPB_CNN.pdf Paper PDF] and [papers/2021_JMPB_CNN.txt BibTeX] | [https://github.com/ADALabUCSD/DeepPostures Code]

- Cerebro: A Layered Data Platform for Scalable Deep Learning\n
Arun Kumar, Supun Nakandala, Yuhao Zhang, Side Li, Advitya Gemawat, and Kabir Nagrecha\n
CIDR 2021 (Vision paper) | [papers/2021_Cerebro_CIDR.pdf Paper PDF] and [papers/2021_Cerebro_CIDR.txt BibTeX] | [https://www.youtube.com/watch?v=8QfMvdlmdic Talk video]

- Cerebro: A Data System for Optimized Deep Learning Model Selection\n
Supun Nakandala, Yuhao Zhang, and Arun Kumar\n
VLDB 2020 | [papers/2020_Cerebro_VLDB.pdf Paper PDF] and [papers/2020_Cerebro_VLDB.txt BibTeX] | [papers/2020_Cerebro_VLDB_Errata.pdf Errata] | [papers/TR_2020_Cerebro.pdf TechReport]
 | Talk videos: [https://www.youtube.com/watch?v=8PJic5FStGs Youtube] [https://www.bilibili.com/video/av329339128?p=198 Bilibili]
 | [https://adalabucsd.github.io/research-blog/cerebro.html Blog post] | [https://databricks.com/session_na20/resource-efficient-deep-learning-model-selection-on-apache-spark SAIS Talk video]
| [https://adalabucsd.github.io/cerebro-system/ Source code and documentation]

- Enabling and Optimizing Non-linear Feature Interactions in Factorized Linear Algebra\n
Side Li, Lingjiao Chen, and Arun Kumar\n
ACM SIGMOD 2019 | [papers/2019_MorpheusFI_SIGMOD.pdf Paper PDF] and [papers/2019_MorpheusFI_SIGMOD.txt BibTeX] | [https://github.com/liside/MorpheusFI Code and Data on Github]

- Tuple-oriented Compression for Large-scale Mini-batch Stochastic Gradient Descent\n
Fengan Li, Lingjiao Chen, Yijing Zeng, Arun Kumar, Jeffrey Naughton, Jignesh Patel, and Xi Wu\n
ACM SIGMOD 2019 | [papers/2019_TOC_SIGMOD.pdf Paper PDF] | [papers/TR_2019_TOC.pdf TechReport] | [https://github.com/fenganli/toc-release-code Code on GitHub]

- Model-based Pricing for Machine Learning in a Data Marketplace\n
Lingjiao Chen, Paraschos Koutris, and Arun Kumar\n
ACM SIGMOD 2019 | [papers/2019_Nimbus_SIGMOD.pdf Paper PDF] | [papers/TR_2018_Nimbus.pdf TechReport] | Code and Data coming soon

- Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems\n
Supun Nakandala, Yuhao Zhang, and Arun Kumar\n
ACM SIGMOD 2019 DEEM Workshop | [papers/2019_Cerebro_DEEM.pdf Paper PDF] and [papers/2019_Cerebro_DEEM.txt BibTeX] | [papers/TR_2019_Cerebro.pdf TechReport]
 | [https://adalabucsd.github.io/research-blog/cerebro.html Blog post]

- The ML Data Prep Zoo: Towards Semi-Automatic Data Preparation for ML\n
Vraj Shah and Arun Kumar\n
ACM SIGMOD 2019 DEEM Workshop | [papers/2019_DataPrepZoo_DEEM.pdf Paper PDF] and [papers/2019_SortingHat_SIGMOD.txt BibTeX] | [papers/TR_2019_DataPrepZoo.pdf TechReport]
 | [https://adalabucsd.github.io/research-blog/research/2019/06/21/mldataprepzoo.html Blog post]
 | [https://github.com/pvn25/ML-Data-Prep-Zoo Data Prep Zoo Repository on GitHub]

- Demonstration of Nimbus: Model-based Pricing for Machine Learning in a Data Marketplace\n
Lingjiao Chen, Hongyi Wang, Leshang Chen, Paraschos Koutris, and Arun Kumar\n
ACM SIGMOD 2019 Demo | [papers/2019_NimbusDemo_SIGMOD.pdf Paper PDF] | Video coming soon

- A Comparative Evaluation of Systems for Scalable Linear Algebra-based Analytics\n
Anthony Thomas and Arun Kumar\n
VLDB 2018/2019  | [papers/2019_SLAB_VLDB.pdf Paper PDF] |
[papers/TR_2018_SLAB.pdf TechReport] | [slab.html Code and Data]

- Are Key-Foreign Key Joins Safe to Avoid when Learning High-Capacity Classifiers?\n
Vraj Shah, Arun Kumar, and Xiaojin Zhu.\n
VLDB 2018 |
[papers/2018_Hamlet_VLDB.pdf Paper PDF] and [papers/2018_Hamlet_VLDB.txt BibTeX]|
[papers/TR_2017_HamletPlusPlus.pdf TechReport] |
[hamlet.html Code and Data]

- Towards Linear Algebra over Normalized Data\n
Lingjiao Chen, Arun Kumar, Jeffrey Naughton, and Jignesh Patel\n
VLDB 2017 |
[papers/2017_Morpheus_VLDB.pdf Paper PDF] |
[papers/TR_2017_Morpheus.pdf TechReport] |
[morpheus.html Code and Data]

- Model-based Pricing: Do Not Pay for More than What You Learn!\n
Lingjiao Chen, Paraschos Koutris, and Arun Kumar\n
ACM SIGMOD 2017 DEEM Workshop |
[papers/2017_Nimbus_DEEM.pdf Paper PDF]

- Cerebro: A System to Manage Deep Learning for Relational Data Analytics\n
Arun Kumar\n
CIDR 2017 Abstract |
[papers/2017_Cerebro_CIDR.pdf Paper PDF]

- To Join or Not to Join? Thinking Twice about Joins before Feature Selection\n
Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu\n
ACM SIGMOD 2016 |
[papers/2016_Hamlet_SIGMOD.pdf Paper PDF] and [papers/2016_Hamlet_SIGMOD.txt BibTeX] |
[papers/TR_2016_Hamlet.pdf TechReport] |
[hamlet.html Code and Data]

- Model Selection Management Systems: The Next Frontier of Advanced Analytics\n
Arun Kumar, Robert McCann, Jeffrey Naughton, and Jignesh M. Patel\n
ACM SIGMOD Record Dec 2015 Vision Track |
[papers/2015_MSMS_SIGMODRecord.pdf Paper PDF]

=== Technical Reports

- How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses\n
Vraj Shah, Thomas Parashos, and Arun Kumar\n
Under submission | [papers/TR_2021_CategDedup.pdf TechReport]

- Nautilus: An Optimized System for Deep Transfer Learning over Evolving Training Datasets\n
Supun Nakandala and Arun Kumar\n
Under submission | [papers/TR_2021_Nautilus.pdf TechReport]

- SystemX: A Scalable and Optimized Data System for Large Multi-Model Deep Learning\n
Kabir Nagrecha and Arun Kumar\n
Under submission | [papers/TR_2021_SystemX.pdf TechReport]

- Improving Feature Type Inference Accuracy of TFDV with SortingHat\n
Vraj Shah, Kevin Yang, and Arun Kumar\n
[papers/TR_2020_TFDV.pdf TechReport]

=== Past Projects

~~~
{}{img_left}{images/hamlet.jpg}{}{80px}{}{}
[hamlet.html *Hamlet*]\n
Exploiting database schema information to simplify data sourcing.
~~~

~~~
{}{img_left}{images/nimbus.jpg}{}{80px}{}{}
[nimbus.html *Nimbus*]\n
Enabling the first ML-aware cloud-based commodity market for the new black gold: training data.
~~~

~~~
{}{img_left}{images/slab.jpg}{}{80px}{}{}
[slab.html *SLAB*]\n
The first comprehensive benchmark comparison of scalable linear algebra systems.
~~~