-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathRDM_IntroRDM.qmd
More file actions
517 lines (332 loc) · 20.3 KB
/
RDM_IntroRDM.qmd
File metadata and controls
517 lines (332 loc) · 20.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
---
title: "Introduction to Research Data Management (RMD)"
subtitle: "A primer for researchers"
author:
- name: Daniel Manrique-Castano, Ph.D
email: daniel.manrique-castano@alliancecan.ca
affiliations:
- name: Digital Research Alliance of Canada
date: last-modified
date-format: full
format:
html:
embed-resources: true
mermaid:
theme: forest
revealjs:
footer: "Introduction to Research Data Management (RMD) - Daniel Manrique-Castano, Ph.D"
logo: "images/alliance_logo.png"
theme: white
transition: fade
slide-number: true
show-slide-number: all
preview-links: auto
mermaid:
theme: forest
mermaid-format: svg
filters:
- reveal-auto-agenda
auto-agenda:
bullets: numbered
heading: Agenda
gfm:
mermaid-format: svg
engine: knitr
css: styles.css
bibliography: references.bib
editor: source
---
# Contextualizaing Research Data Management (RDM)
## Why “bother” with RDM?{.center}
[Data]{style="color:green;"} is the [foundation of scientific research]{style="text-decoration: underline;"}. We want to best manage this source for modeling, workflow automation, quality control, and data sharing with our field to produce meaningful scientific work.
::: callout-tip
## The 80/20 rule
Researchers spend [80% ]{style="color:green;"} of their time finding, producing, cleaning, and organizing data.
:::
## We want our research de be reproducile{.center}
::: {style="text-align: center; font-size: 70%"}
{fig-align="center" width="1200" height="500"}
:::
## Some well-known causes
::: {style="text-align: center; font-size: 70%"}
{fig-align="center"}
:::
## Creating reproducible science{.center}
{{< bi file-earmark-text >}} [Document data provenance]{style="color:green;"} (samples, software, instruments, other tools).
{{< bi gear >}} Define [standard procedures and methods]{style="color:green;"} (i.e., how measurements were made).
{{< bi funnel >}} [Quality control]{style="color:green;"} measures (data evaluation, organization and cleaning).
{{< bi file-earmark-text >}} [Comprehensive documentation]{style="color:green;"} to enable understanding and reuse of data.
## Therefore, research needs to move towards{.center}
- [Competent]{style="color:green;"} researchers in [RDM]{style="text-decoration: underline;"} and data analysis.
- [Standardized approaches]{style="color:green;"} to [sharing]{style="text-decoration: underline;"} raw data and analysis code to support research findings.
- Researchers with a [commitment to transparency]{style="color:green;"} and best scientific practice practices to ensure [research integrity]{style="text-decoration: underline;"}.
::: callout-tip
## RDM competences
Think of RDM competencies as a matter of professional responsibility, ethical conduct, and commitment to best scientific practices.
:::
## Tri-Agency Research Data Management Policy{.center}
The [Goverment of Canada](https://science.gc.ca/site/science/en/interagency-research-funding/policies-and-guidelines/research-data-management) promotes RDM in its [Tri-Agency Research Data Management Policy](https://science.gc.ca/site/science/en/interagency-research-funding/policies-and-guidelines/research-data-management/tri-agency-research-data-management-policy-frequently-asked-questions).
Through its federal funding agencies, the the Government of Canada seeks to implement [data management plans (DMPs)]{style="text-decoration: underline;"} and [sharing]{style="text-decoration: underline;"} of research data to maximize the benefits to society.
## We can define RDM as: {.center}
[RDM]{style="color:green;"} encompasses the organization/handling, storage, sharing and preservation of research data, including physical objects/files, documents, images, audio, digital databases, etc.
# Understanding RDM
## Where does data end up?{.center}
::::: {layout-ncol="4"}
:::: {#first-column}
::: {style="text-align: center;font-size: 60%"}
{fig-align="center" width="100%"}
:::
::::
:::: {#second-column}
::: {style="text-align: center;font-size: 60%"}
{fig-align="center" width="100%"}
:::
::::
:::: {#third-column}
::: {style="text-align: center;font-size: 60%"}
{fig-align="center" width="100%"}
:::
::::
:::: {#forth-column}
::: {style="text-align: center;font-size: 60%"}
{fig-align="center" width="100%"}
:::
::::
:::::
## Considerations for storing data{.center}
{{< bi clock >}} In most cases, data must be retained for [10 years]{style="color:green;"}.
{{< bi safe >}} What do we need to [store]{style="color:green;"}?
{{< bi file-earmark-text >}} What [data formats]{style="color:green;"} provide access?
{{< bi gear >}} What [technical requirements]{style="color:green;"} do I need?
{{< bi unlock >}} How [secure]{style="color:green;"} is the data and how can I access it?
::: callout-tip
These are all aspects of good scientific practice!
:::
## What research data we need/could store?{.center}
{{< bi thermometer-snow >}} [Experimental samples]{style="color:green;"} (physical samples and files, and the methods for generating them).
{{< bi robot >}} [Scientific instruments]{style="color:green;"} (model, version, setup).
{{< bi journals >}} [Experimental data]{style="color:green;"} (raw and processed data, plots, tables and figures).
{{< bi code >}} [Code and analysis pipelines]{style="color:green;"} (processing and analysis workflows, software and libraries, machine learning classifiers, statistical models).
## {.center}
::: {style="text-align: center; font-size: 70%"}
{fig-align="center" width="700" height="350"}
:::
## {.center}
::: {style="text-align: left;font-size: 80%"}
{{< bi search>}} [**Findable:**]{style="color:green;"} Data with [identifiers and metadata]{style="text-decoration: underline;"} that facilitate its discovery.
{{< bi unlock>}} [**Accessible:**]{style="color:orange;"} To the community and research under clear [permissions]{style="text-decoration: underline;"} (i.e [Creative Commons Licenses](https://creativecommons.org/share-your-work/cclicenses/)).
{{< bi upc-scan>}} [**Interoperable:**]{style="color:blue;"} Data conforming to [specific standards]{style="text-decoration: underline;"} (i.e. standardized vocabularies) and [formats]{style="text-decoration: underline;"} (i.e. CSV, TIFF).
{{< bi recycle>}} [**Reusable:**]{style="color:magenta;"} Data [sufficiently described]{style="text-decoration: underline;"} (and licensed) to allow reuse by interested parties.
:::
::: callout-tip
Additional information and guidelines can be found [here](https://www.go-fair.org/fair-principles/)
:::
# Research project life cycle
## Research project life cycle{.center}
::: {style="text-align: center; font-size: 70%"}
{fig-align="center" width="600" height="400"}
:::
## Planning phase{.center}
:::: {style="text-align: left;font-size: 70%"}
Planning includes [clear perspectives]{style="color:green;"} on how researchers will [manage, store, and share]{style="text-decoration: underline;"} their data. This step may include:
{{< bi clipboard-data>}} **Identify data needs:** Determine what [type of data]{style="text-decoration: underline;"} is needed or available to answer the research question.
{{< bi file-earmark-text>}} **Create a Data Management Plan (DMP):** Outline how data will be [collected, documented, and preserved]{style="text-decoration: underline;"}, file formats, appropriate [metadata standards]{style="text-decoration: underline;"}, and long-term data [sharing and preservation]{style="text-decoration: underline;"}.
::: callout-tip
## DMP assistants
The [DMP Assistant](https://alliancecan.ca/en/services/research-data-management/dmp-assistant) from the [Digital Research Alliance of Canada]{style="color:orange;"} is an example of a DMP tool.
:::
{{< bi file-person>}} **Ethical considerations:** Address any ethical requirements related to the data [privacy, consent, and security]{style="text-decoration: underline;"}.
::::
## Collection phase{.center}
:::: {style="text-align: left;font-size: 80%"}
Researchers focus on [collecting research data]{style="text-decoration: underline;"} from:
- **Experiments:** laboratory procedures or clinical trials.
- **Fieldwork:** population surveys or environmental measurements.
- **Existing datasets:** secondary data from repositories or historical archives).
::: callout-tip
## Consider:
- Standards for data collection.
- Technical elements (what data is stored and how).
- Adequate documentation and relevant metadata.
:::
::::
## Storing phase {.center}
:::: {style="text-align: left;font-size: 80%"}
Properly store research data to prevent loss, corruption, or unauthorized access. Some [best practices]{style="color:green;"} include:
{{< bi database>}} **Systematic backups:** Having [multiple copies]{style="color:green;"} of the data in multiple physical and digital locations.
{{< bi file-lock2>}} **Security:** Sensitive data must be [protected]{style="color:green;"} from unauthorized access by encrypting files or restricting access to authorized personnel.
{{< bi bar-chart-steps>}} **File organization:** Appropriate and consistent [naming conventions]{style="color:green;"} to facilitate the location and understanding of records.
::::
::: callout-tip
## Active RDM platforms
Consider using active data management platforms suah as the [Open Science Framework](https://osf.io/) (OSF) or [OpenBis](https://openbis.ch/).
:::
## {.center}
::::: {layout-ncol="2" layout-valign="center"}
:::: {#first-column}
:::{style="text-align: left;font-size: 80%"}
For example, [TIER 4.0](https://www.projecttier.org/tier-protocol/protocol-4-0/root/) is [systemic template]{style="color:green;"} to standardize an appropriate folder structure for research data.
You can download the template [here](https://github.com/Alliance-RDM-GDR/RDM_IntroRDM/tree/main/resources).
:::
::::
:::: {#second-column}
::: {style="text-align: center;font-size: 50%"}
{width="35%"}
:::
::::
:::::
## Analyze phase{.center}
:::: {style="text-align: left;font-size: 80%"}
Analyzing data to draw scientific conclusions. This may include:
{{< bi clipboard-data>}} **Data cleaning:** Using [reproducible workflows]{style="color:green;"} to handle and clean data, removing possible errors, duplicates, and inconsistencies.
{{< bi code-square>}} **Statistical analysis:** Use theoretically motivated [statistical models]{style="color:green;"} to make scientific inference and identify patterns or relationships in the data.
{{< bi image>}} **Visualization:** Create visualizations to [present scientific results]{style="color:green;"}.
::::
::: callout-tip
## Analysis notebooks
Approximations using open source [Jupyter](https://jupyter.org/) (Python) or [Quarto](https://quarto.org/) (R and Python) notebooks are recommended for replication purposes.
:::
## Partners to handle analysis workflows{.center}
::::: {layout-ncol="2" layout-valign="center"}
:::: {#first-column}
### {{< bi code-square>}} R-studio/Quarto (R + Python)
::: {style="text-align: center;font-size: 100%"}
{fig-align="center" width="500" height="300"}
:::
::::
:::: {#second-column}
### {{< bi github>}} GitHub (Version control)
::: {style="text-align: center;font-size: 100%"}
{fig-align="center" width="500" height="300"}
:::
::::
:::::
## With R-studio (R and Python) you can {.center}
::::: {layout-ncol="2" layout-valign="center"}
:::: {#first-column}
### {{< bi code-square>}} R-studio/Quarto (R + Python)
::: {style="text-align: center;font-size: 100%"}
{fig-align="center" width="500" height="300"}
:::
::::
:::: {#second-column}
:::{style="text-align: left;font-size: 75%"}
{{< bi file-earmark-spreadsheet-fill>}} Handle [data tables and variables](https://github.com/elalilab/Stroke_PDGFR-B_Reactivity/blob/main/Widefield_10x_ROIs_CD31-Pdgfrb_Coloc.qmd) using the R [Tidyverse](https://www.tidyverse.org/).
{{< bi images>}} Analyze [images](https://github.com/elalilab/Stroke_PDGFR-B_Reactivity/blob/main/Confocal_40x_ROIs_Pdgfrb_Morpho_BatchScript.qmd) using Python [skimage](https://scikit-image.org/).
{{< bi wallet-fill>}} Process [Flow cytometry](https://github.com/elalilab/Stroke_PDGFR-B_Reactivity/blob/main/FlowCytometry_Pdgfrb_Processing.qmd) files/data using R [FlowCore](https://bioconductor.org/packages/release/bioc/html/flowCore.html) from [BioConductor](https://bioconductor.org/).
{{< bi bandaid>}} Analyze [RNS-seq](https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) data using R [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) from [BioConductor](https://bioconductor.org/).
{{< bi bar-chart-line-fill>}} Do state-of-the-art [statistical modeling](https://github.com/elalilab/Stroke_PDGFR-B_Reactivity/blob/main/Widefield_5x_Ipsilateral_Gfap_IntDen.qmd) using [brms](https://paulbuerkner.com/brms/).
And anything else you can think of...
:::
::::
:::::
## Description phase{.center}
::: {style="text-align: left;font-size: 80%"}
Document the [content of the dataset]{style="text-decoration: underline;"} to make it easier to [understand and reuse]{style="text-decoration: underline;"}. This comprises the generation of a readme file containing [Metadata]{style="color:green;"} schemes (information about the data).
:::
::::: {layout-ncol="2" layout-valign="center"}
:::: {#first-column}
::: {style="text-align: center; font-size: 50%"}
{fig-align="center" width="700" height="300"}
:::
::::
:::: {#second-column}
::: {style="font-size: 70%;"}
There are templates/resources to guide the generation of readme files:
- [Creating a README file](https://blog.datadryad.org/2023/10/18/for-authors-creating-a-readme-for-rapid-data-publication/)
- FRDR readme file [template](https://www.frdr-dfdr.ca/docs/txt/README.txt).
- [Readme.so](https://readme.so/)
- [Readme.ai](https://readme-ai.streamlit.app/)
:::
::::
:::::
## Contents of a readme file{.center}
:::{.r-fit-text}
In general, a dataset readme file shows:
{{< bi database-fill >}} [Dataset identifier]{style="color:green;"} showing information such as title, authors, data collection date, geographic information.
{{< bi folder-fill >}} A [map of files/folders]{style="color:green;"} defining the contents and hierarchy of folders and subfolders, along with naming conventions.
{{< bi gear-fill >}} [Methodological information]{style="color:green;"} presenting methods for data collection/generation, analysis, and experimental conditions.
{{< bi sd-card-fill >}} A set of [instructions and software]{style="color:green;"} for opening, handling and reproducing research and analysis workflows.
{{< bi credit-card-2-front-fill >}} [Sharing and access information]{style="color:green;"} detailing permissions and terms of use.
:::
::: {.callout-caution collapse="true"}
A dataset is a [standalone object]{style="text-decoration: underline;"}. Methodological information [MUST NOT]{style="color:red;"} be relegated to associated research articles.
:::
## Archiving and Sharing Phase{.center}
::: {style="text-align: left;font-size: 90%"}
Archiving and sharing research data involves its long-term preservation in trusted data repositories, ensuring that the data are:
{{< bi search>}} **Accessible:** Data is [findable and accessible]{style="color:green;"} for reuse.
{{< bi database>}} **Preserved:** To [prevent data loss]{style="color:green;"} due to software or hardware obsolescence. Repositories have systematic process to ensure that data is always accessible.
{{< bi safe2>}} **Licensed:** to define how others can [use the data]{style="color:green;"}.
{{< bi link-45deg>}} **Cited:** with a [DOI]{style="color:green;"} or other associated unique identifier.
:::
## We can do better science{.center}
::::: {layout-ncol="2" layout-valign="center"}
:::: {#first-column}
::: {style="text-align: left; font-size: 70%"}
**Data availability statement**
"The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation."
:::
::::
:::: {#second-column}
::: {style="text-align: center; font-size: 70%"}
{fig-align="center" width="600" height="100"}
::::
:::::
::::::
::: callout-tip
We need to be more aware of our responsibilities and duties as scientists!
:::
## Sharing data (in repositories){.center}
::: {style="text-align: left;font-size: 80%"}
When you share data, [make sure]{style="color:green;"} that it meets these characteristics:
{{< bi diagram-2-fill >}} [Folders and files]{style="color:green;"} are organized in a [structured way]{style="text-decoration: underline;"}: Use [standardized file formats]{style="color:green;"} (e.g., CSV, TIFF) and check for consistency in [naming conventions]{style="color:green;"}.
{{< bi file-earmark-text-fill >}} [The metadata/readme]{style="color:green;"} allow the dataset to be understood as a standalone object, providing [collection methods, processing steps, and relevant context]{style="text-decoration: underline;"}.
{{< bi code-square >}} It is [desirable]{style="color:green;"} that it contains reproducible workflows used to process and generate the [research results]{style="text-decoration: underline;"}.
:::
## {.center}
Be aware that datasets are research elements that [serve the public and the scientific community]{style="color:green;"}, and that can be used (and cited) [independently]{style="color:green;"} of a research article.
::: callout-tip
## Why not?
Think of research articles as additions to your dataset!
:::
## Reuse
::: {style="text-align: left;font-size: 70%"}
Researchers and the public can reuse data to:
- **Avoid unnecessary or costly experiments** by using previous research results.
- **Validate research findings:** [Independent verification]{style="color:green;"} of scientific results and conclusions (by replicating research workflows).
- **Repurpose data:** Use the data for [new research questions]{style="color:green;"} or in combination with other datasets. They are also extremely valuable as [educational resources]{style="color:green;"}.
- **Build upon previous work:** to [accelerate scientific discovery]{style="color:green;"} and meta-analysis by avoiding duplication of efforts or reliance on [irreproducible research]{style="color:red;"}.
:::
::: callout-tip
## Discovering services
Use portals like [FRDR](https://www.frdr-dfdr.ca/repo/) or [Lunaris](https://www.lunaris.ca/) to discover and reuse research data.
:::
# Data as the new oil
## Data is a valuable resource{.center}
{{< bi columns >}} Simulation-based approaches, machine learning and artificial intelligence.
{{< bi gear >}} Build workflows based on existing data and pipelines.
{{< bi share-fill >}} Share and reuse data with the scientific community to build collaborations.
## How do we get there?
- [FAIR]{style="color:green;"} data: Making our data useful to others.
- Adequate and comprehensive [documentation]{style="color:green;"} that contextualizes and describes our data.
- Sharing [reproducible workflows]{style="color:green;"} that ensure the reproducibility of results (protocols, open file formats, analysis workflows).
- Incorporate an [open science culture]{style="color:green;"} into research environments, focusing on transparent reporting of research results and open data sharing.
## Find more supporting material{.center}
- FRDR [guide](https://www.frdr-dfdr.ca/docs/en/depositing_data/) for deposit research data.
- Guidance on depositing existing data in [public repositories](https://ethics.gc.ca/eng/depositing_depots.html)
- [RDMkit](https://rdmkit.elixir-europe.org/sharing)
::::: {layout-ncol="2" layout-valign="center"}
::::{#first-column}
::: {style="text-align: center;font-size: 50%"}
{width="90%"}
:::
::::
::::{#second-column}
::: {style="text-align: left;font-size: 80%"}
Visit us at
https://www.frdr-dfdr.ca/repo/
or [contact us](https://www.frdr-dfdr.ca/repo/contactus)
:::
::::
:::::