-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathUsing SQL with a DataFrame.html
More file actions
733 lines (567 loc) · 172 KB
/
Using SQL with a DataFrame.html
File metadata and controls
733 lines (567 loc) · 172 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
<!DOCTYPE html>
<html>
<head>
<meta name="databricks-html-version" content="1">
<title>Using SQL with a DataFrame - Databricks</title>
<meta charset="utf-8">
<meta name="google" content="notranslate">
<meta name="robots" content="nofollow">
<meta http-equiv="Content-Language" content="en">
<meta http-equiv="Content-Type" content="text/html; charset=UTF8">
<link rel="stylesheet"
href="https://fonts.googleapis.com/css?family=Source+Code+Pro:400,700">
<!--
// Databricks notebook source
// MAGIC %md
// MAGIC <div style="text-align: center; line-height: 0; padding-top: 9px;">
// MAGIC <img src="https://cdn2.hubspot.net/hubfs/438089/docs/training/dblearning-banner.png" alt="Databricks Learning" width="555" height="64">
// MAGIC </div>
// COMMAND ----------
// MAGIC %md
// MAGIC # Notes to Brian
// MAGIC Mark "Read:" once read. Or "Agreed:" or "Let's Discuss:" or any other tag.
// MAGIC
// MAGIC ## Comments on notebook
// MAGIC * We are going to have a major problem hyperlinking between notebooks in a portable way.
// MAGIC * **(Doug)** This is a real challenge we'll have to address.
// MAGIC * See https://databricks.atlassian.net/browse/FEATURE-1590
// MAGIC * The workaround, for now, is to simply tell people where to find the next notebook or challenge exercise.
// MAGIC * Updated the format for Prerequisites
// MAGIC * (Rejected) Search for `val ssaNamesDF = 0 // <<FILL_IN>>` and decide if it's acceptable. It makes it possible to run_all.
// MAGIC * (Brian) Why is "run all" a priority? You can always run the instructor notebook, which has only the answers. I don't see the value to "run all" in the student notebook. Is it because we want "run all" in the master notebook? Personally, I don't see that as a high priority.
// MAGIC * **(Doug)** I agree. If we make it easy to compile and upload notebooks (that script we discussed) then I'm game for not making run-all a priority. It is nice though.
// MAGIC * (Rejected) This notebook can now be detached and reattached and run_all to make sure all tests pass. But at the cost listed in the bullet above.
// MAGIC * (Brian) Again, that can just be done with the instructor or answers notebook, too.
// MAGIC * **(Doug)** Agreed. As long as we have nice tooling to ensure everything runs, I'm happy.
// MAGIC * (Agreed) Notice the added learning objective `%sql`. Does it make sense to include in this notebook?
// MAGIC * (Agreed) Notice the new Review section at the end that recaps the main take aways from the learning objectives.
// MAGIC * If you can think of any further reading to add at the end (perhaps links to API doc and Databricks guide) let's prototype how that section should look.
// MAGIC * **(Doug)** Can you think of any further reading for your lesson?
// MAGIC * **(Brian)** Yes. Working on it.
// MAGIC * For videos, let's use `// VIDEO "URL"`, and then we have flexability for how to mark it up later when they add `%html`. (Actually instead of `%html` we'll get `%md-unsantized`.)
// MAGIC * (Brian) That requires tooling updates to remove the `//VIDEO "URL"` comment for now.
// MAGIC * **(Doug)** Fair point. Which will take longer: editing a bunch of notebooks when we have a style change or editing the tool?
// MAGIC * **(Doug)** You've expressed desire to redo the tool in Scala. Would that take a week? We can probably do it after our HP delivery.
// MAGIC * (Agreed) At some point we'll need to decide how to do "verbose" version for eLearning and "concise" version for the teacher delivering instructor lead training.
// MAGIC * (Brian) This comment needs to be recorded some place more permanent.
// MAGIC * **(Doug)** Agreed. Want to create a doc somewhere in drive under /curriculum? If you create the doc then you'll know where to find it and can own it.
// MAGIC * (Agreed) For challenge exercises, we'll include them as links at the end. Either in a "Challenges" section or in the "Further Reading" section. Challenges in such a case should actually be as real world related as possible.
// MAGIC * (Brian) Maybe a better approach is to include it in the "Further reading" section, as a link to a separate "Challenge" notebook.
// MAGIC * **(Doug)** I like that solution.
// MAGIC *(Rejected) It would be nice if there was a way to add a table of contents to the top. This doesn't seem possible.
// MAGIC * (Brian) `%html` would be nice, too. As I told Jacob today, though, I think content is a higher priority than slick presentation-related additions. It's theoretically possible to modify the build tool to generate a TOC, _but_ URL fragments (\#header, for instance) don't seem to work very well in the notebook environment. (I tried it).
// MAGIC * **(Doug)** Agreed, we're going to have to let go of this request, at least for now.
// MAGIC * (Rejected) Notice at the end of the challenge Doug prototyped a different method of doing tests, using `str(...)="..."` thinking it might be easier for less knowlegable people to understand. It may not actually be any better. Doug is curious what is Brian's opinion.
// MAGIC * (Brian) My view of the asserts is that they're opaque. Students don't need to understand them. They just need to trust that they accurately test the solution. And using a `toString` approach worries me, as the string representation (of a Row, or any other data structure) can change, from release to release, without invalidating any working code. Testing a `toString` representation makes the notebooks more fragile.
// MAGIC * **(Doug)** Agreed.
// MAGIC * Made the tag at the top "STANDARD_HEADER" since that name is more generic. However it breaks the tool. So we need to decide if it really is better and worth updating the tool.
// MAGIC * (Brian) I don't like `STANDARD_HEADER`. That sounds, to me, too much like, "Oh, Databricks has a standard header?" I prefer having the word "training" in there somewhere, hence `TRAINING_HEADING`
// MAGIC * **(Doug)** What about when a MOOC or 3rd party university wants to include a header?
// MAGIC * **(Doug)** I don't feel strongly either way. Just asking the question. You're right `STANDARD` isn't the right word. Maybe `MODULE_HEADER` or `NOTEBOOK_HEADER`?
// MAGIC * **(Brian)** I prefer "heading" to "header", but a more generic term is probably better. `NOTEBOOK_HEADER` works for me. I can change the build tooling easily enough, once we agree on the term we want to use.
// MAGIC * (Agreed) It would be nice to add a copyright notice to the NOTEBOOK_HEADER too. (C) Databricks, Inc 2017, all rights reserved.
// MAGIC * (Brian) I'm fine with the copyright.
// MAGIC
// MAGIC ## Comments on the video
// MAGIC * We can ask Val to provide a standard "title sequence" at the start that labels it Databricks. I think we do that for Summit videos and Blog videos.
// MAGIC * (Brian) That's fine.
// MAGIC * Do we want in the video to show people where to find the notebook? Probably doesn't make sense because they're already in the notebook if watching the video. Although they could in theory be watching the video from a webpage too.
// MAGIC * (Brian) If we host the embed the video _outside_ the notebook, the URL should be in that page. That's how the MOOCs worked.
// MAGIC * **(Doug)** Agreed.
// MAGIC * When recording video, set the computer screen resolution to 1024x640. Not everyone has a high res monitor (india, etc) and it can be hard to see even if full-screen. Full screen is possible once they add `%md-unsantized`. We could alternatively include a hyperlink for people to watch the video in another tab instead of embedding it. I also noticed that even when I make the window bigger, it doesn't actually make the video bigger.
// MAGIC * (Brian) This should be recorded somewhere.
// MAGIC * **(Doug)** Agreed. As style owner, can you create a style guide?
// MAGIC * Apparently Vimeo has features for tracking users that Youtube doesn't have. Let's plan to check that out during phase 2 of the project.
// MAGIC * **(Doug)** In the video and lesson we make reference to the DataFrames api before the sql API. We should discuss which we want to teach first... SQL or Dataframe query API.
// MAGIC * **(Doug)** And this raises a really important point... we need to agree and document really clearly what can be assumed going into a lesson if we're going to create these lessons in parallel.
// MAGIC * **(Doug)** In the video we also show how to createTable and why it is bad. Let's discuss as a style issue how much tangental stuff we want to include. I'm on the fence.
// MAGIC
// MAGIC ## Comments on notebook style
// MAGIC * Style guide (Brian owns style, Doug is just making suggestions that Brian can reject.)
// MAGIC * For markdown cells use `--` for comments. [Looks better than `##` and `//` (`##` is a markdown macro).] **(Brian) This is inconsistent with most of our other notebooks. Personally, I find them _both_ ugly, and I have no real preference.**
// MAGIC * For `// TODO` include a space between // and TODO
// MAGIC * Have `%scala` at the top of scala only cells, even if it's the notebook type.
// MAGIC * Put `<<FILL_IN>>` in comments to keep the notebook runnnable. **(Brian) I'm not sure how practical this is. `x = # <<FILL_IN>>` is still not runnable.**
// MAGIC * Alternate `%scala` and `%python` cells, so that each cell has 2 versions side by side.
// MAGIC * Format code in `%md` with `code`
// MAGIC * For asserts in Scala, prefer toList to toSeq as it's easier for lay people to understand. Performance difference is minor. Arrays exist under the hood either way once you do a collect() and an explicitly defined Seq.
// MAGIC * Later, for contractors, we can provide a sample lesson notebook that teaches the style best practices. For now let's just keep a doc somewhere that is our style guide.
// COMMAND ----------
// MAGIC %md
// MAGIC # Using SQL with a DataFrame
// MAGIC
// MAGIC Apache Spark allows you to use SQL or the programmatic DataFrame API to construct your queries.
// MAGIC
// MAGIC ## In this lesson you:
// MAGIC * Query an existing DataFrame using Spark SQL.
// MAGIC * Use the Databricks %sql feature.
// MAGIC
// MAGIC ## Audience
// MAGIC * General audiences, including Analysts, Engineers, and Data Scientists
// MAGIC
// MAGIC ## Prerequisites
// MAGIC * Lesson: Introduction to Scala or Python [link] (Let's point an online tutorial instead.)
// MAGIC * Lesson: Introduction to Dataframes [link]
// MAGIC * Concept: The Databricks built-in `display(...)` function [link]
// MAGIC * Concept: The Spark `dataframe.printSchema()` function [link]
// COMMAND ----------
// MAGIC %md Please run the following cell by placing your cursor inside the cell and typing Ctrl-Enter or Shift-Enter. Doing so will display an embedded video. Watch the video and follow along in this notebook.
// COMMAND ----------
// MAGIC %md
// MAGIC ## Lesson
// MAGIC Watch the following video.<br/></br>
// MAGIC
// MAGIC <a href="https://www.youtube-nocookie.com/embed/836_VC_14rQ?rel=0" target="video" style="display: flex; flex-flow: column; justify-content: center; align-items: center; width: 432px; height: 243px; background: black; color: white; text-decoration: none; font-weight: bold; border-radius: 2px;">
// MAGIC <img src="https://cdn2.hubspot.net/hubfs/438089/docs/training/notebook-dblearning-thumbnail.png" alt="Play Video" width="290" height="92" style="margin-bottom: 8px;">
// MAGIC Using SQL with Dataframes
// MAGIC </a>
// COMMAND ----------
// MAGIC %scala
// MAGIC val peopleDF = spark.read.parquet("dbfs:/mnt/training/dataframes/people-10m.parquet")
// COMMAND ----------
// COMMAND ----------
// MAGIC %python
// MAGIC people_df = spark.read.parquet("dbfs:/mnt/training/dataframes/people-10m.parquet")
// COMMAND ----------
// MAGIC %md
// MAGIC We can easily query this DataFrame using the programmatic DataFrame API. For instance, the following query selects people were born in 1990 and whose first names start with "A".
// COMMAND ----------
// MAGIC %scala
// MAGIC import org.apache.spark.sql.functions._
// MAGIC
// MAGIC display(
// MAGIC peopleDF
// MAGIC .select($"firstName", $"lastName", $"birthDate")
// MAGIC .filter(year($"birthDate") === 1990)
// MAGIC .filter($"firstName" like "A%")
// MAGIC .orderBy("firstName", "lastName")
// MAGIC )
// COMMAND ----------
// MAGIC %python
// MAGIC from pyspark.sql.functions import *
// MAGIC
// MAGIC display(
// MAGIC people_df
// MAGIC .select(col("firstName"), col("lastName"), col("birthDate"))
// MAGIC .filter(year(col("birthDate")) == 1990)
// MAGIC .filter(col("firstName").like("A"))
// MAGIC .orderBy("firstName", "lastName")
// MAGIC )
// COMMAND ----------
// MAGIC %md
// MAGIC But, what if we want to issue the same query in SQL?
// MAGIC
// MAGIC First, we need to give the DataFrame a table name. We _could_ simply save the DataFrame as a Hive table, using
// MAGIC `df.write().saveAsTable("people")`, but doing so incurs an output penalty: It write the _entire_ DataFrame to the
// MAGIC distributed file system.
// MAGIC
// MAGIC Wouldn't it be easier if we could just query the _in-memory_ DataFrame as if it were a table?
// MAGIC
// MAGIC We can do exactly that, using `createOrReplaceTempView()`.
// COMMAND ----------
// MAGIC %scala
// MAGIC peopleDF.createOrReplaceTempView("people")
// COMMAND ----------
// MAGIC %python
// MAGIC people_df.createOrReplaceTempView("people")
// COMMAND ----------
// MAGIC %md
// MAGIC If we just want to display the output of our query, we can use the special Databricks `%sql` cell:
// COMMAND ----------
// MAGIC %sql SELECT firstName, lastName, birthDate FROM people WHERE year(birthDate) = 1990 AND firstName LIKE 'A%' ORDER BY firstName, lastName
// MAGIC -- ALL_NOTEBOOKS
// COMMAND ----------
// MAGIC %md
// MAGIC The temporary view is just a table alias, a _name_ that refers to the in-memory DataFrame. It is not backed by a real, on-disk table. The name lasts until the SparkSession goes away.
// MAGIC
// MAGIC We can also save the results of our query in a _new_ DataFrame, in case we can to do additional queries against it.
// COMMAND ----------
// MAGIC %scala
// MAGIC val peopleDF2 = spark.sql("SELECT firstName, lastName, birthDate FROM people WHERE year(birthDate) = 1990 AND firstName LIKE 'A%' ORDER BY firstName, lastName")
// COMMAND ----------
// MAGIC %scala
// MAGIC display(
// MAGIC peopleDF2.filter($"lastName" like "Z%")
// MAGIC )
// COMMAND ----------
// MAGIC %python
// MAGIC people_df2 = spark.sql("SELECT firstName, lastName, birthDate FROM people WHERE year(birthDate) = 1990 AND firstName LIKE 'A%' ORDER BY firstName, lastName")
// COMMAND ----------
// MAGIC %python
// MAGIC display(
// MAGIC people_df2.filter(col("lastName").like("Z%"))
// MAGIC )
// COMMAND ----------
// MAGIC %md
// MAGIC <a href="foo"></a>
// MAGIC ## Exercise 1
// MAGIC
// MAGIC Given the following DataFrame, use SQL to find all zip codes for Santa Monica, California. Store the result in a DataFrame called `smzips`. The DataFrame should contain only the `zipCode` column.
// COMMAND ----------
// MAGIC %scala
// MAGIC val zips = spark.read.json("dbfs:/mnt/training/zips.json").withColumnRenamed("_id", "zipCode")
// COMMAND ----------
// MAGIC %python
// MAGIC zips = spark.read.json("dbfs:/mnt/training/zips.json").withColumnRenamed("_id", "zipCode")
// COMMAND ----------
// MAGIC %md
// MAGIC -- SCALA_ONLY
// MAGIC
// MAGIC **Hints**
// MAGIC
// MAGIC * You'll need to examine the data first.
// MAGIC * You'll need to use some of the functions in [org.apache.spark.sql.functions](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$)
// COMMAND ----------
// MAGIC %md
// MAGIC -- PYTHON_ONLY
// MAGIC
// MAGIC
// MAGIC **Hints**
// MAGIC
// MAGIC * You'll need to examine the data first.
// MAGIC * You'll need to use some of the functions in [pyspark.sql.functions](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions)
// COMMAND ----------
// MAGIC %scala
// MAGIC // TODO
// MAGIC //
// MAGIC // Replace <<FILL_IN>> with your code.
// MAGIC zips.<<FILL_IN>>
// MAGIC val smzips = <<FILL_IN>>
// COMMAND ----------
// MAGIC %python
// MAGIC # TODO
// MAGIC #
// MAGIC # Replace <<FILL_IN>> with your code.
// MAGIC zips.<<FILL_IN>>
// MAGIC smzips = <<FILL_IN>>
// COMMAND ----------
// MAGIC %scala
// MAGIC // ANSWER
// MAGIC zips.createOrReplaceTempView("zips")
// MAGIC val smzips = spark.sql("""SELECT zipCode FROM zips WHERE state = 'CA' and city = 'SANTA MONICA'""")
// COMMAND ----------
// MAGIC %python
// MAGIC # ANSWER
// MAGIC zips.createOrReplaceTempView("zips")
// MAGIC smzips = spark.sql("""SELECT zipCode FROM zips WHERE state = 'CA' and city = 'SANTA MONICA'""")
// COMMAND ----------
// MAGIC %md
// MAGIC Run the following cell to test your answer.
// COMMAND ----------
// MAGIC %scala
// MAGIC // TEST
// MAGIC // Run this test to verify your are on the right track.
// MAGIC assert(smzips.as[String].collect().toSet == Set("90401", "90402", "90403", "90404", "90405"))
// COMMAND ----------
// MAGIC %python
// MAGIC # TEST
// MAGIC # Run this test to verify your are on the right track.
// MAGIC %%
// MAGIC
// MAGIC assert(set([row.zipCode for row in smzips.collect()]) == {u"90401", u"90402", u"90403", u"90404", u"90405"})
// COMMAND ----------
// MAGIC %md
// MAGIC ## Exercise 2
// MAGIC
// MAGIC Load the following file into DataFrame. The file consists of data from the U.S. Social Security Administration. Each row contains:
// MAGIC
// MAGIC * a first name
// MAGIC * a birth year
// MAGIC * a count of the number of babies born in that year with that first name
// MAGIC
// MAGIC Once you've loaded it into a DataFrame, use SQL (not the DataFrame API) to find the 5 most popular girl's first names in 1885.
// COMMAND ----------
// MAGIC %scala
// MAGIC // TODO
// MAGIC val ssaNamesFile = "dbfs:/mnt/training/ssn/names.parquet"
// MAGIC val ssaNamesDF = 0 // <<FILL_IN>>
// COMMAND ----------
// MAGIC %python
// MAGIC # TODO
// MAGIC ssaNamesFile = "dbfs:/mnt/training/ssn/names.parquet"
// MAGIC ssaNamesDF = 0 # <<FILL_IN>>
// COMMAND ----------
// MAGIC %scala
// MAGIC // ANSWER
// MAGIC val ssaNamesFile = "dbfs:/mnt/training/ssn/names.parquet"
// MAGIC val ssaNamesDF = spark.read.parquet(ssaNamesFile)
// COMMAND ----------
// MAGIC %python
// MAGIC # ANSWER
// MAGIC ssaNamesFile = "dbfs:/mnt/training/ssn/names.parquet"
// MAGIC ssaNamesDF = spark.read.parquet(ssaNamesFile)
// COMMAND ----------
// MAGIC %scala
// MAGIC // TEST
// MAGIC // Run this test to verify your are on the right track.
// MAGIC assert(
// MAGIC ssaNamesDF.as[(String, String, Int, Int)].limit(5).collect.toList ==
// MAGIC List(("Jennifer","F",54336,1983), ("Jessica","F",45278,1983), ("Amanda","F",33752,1983), ("Ashley","F",33292,1983), ("Sarah","F",27228,1983))
// MAGIC )
// COMMAND ----------
// MAGIC %python
// MAGIC # TEST
// MAGIC # Run this test to verify your are on the right track.
// MAGIC assert(
// MAGIC [tuple(row) for row in ssaNamesDF.limit(5).collect()] ==
// MAGIC [(u'Jennifer', u'F', 54336, 1983), (u'Jessica', u'F', 45278, 1983), (u'Amanda', u'F', 33752, 1983), (u'Ashley', u'F', 33292, 1983), (u'Sarah', u'F', 27228, 1983)]
// MAGIC )
// COMMAND ----------
// MAGIC %md
// MAGIC **Step 2a:** Use the built-in Databricks `display(...)` function to view and explore `ssaNamesDF`.
// COMMAND ----------
// MAGIC %scala
// MAGIC // TODO
// COMMAND ----------
// MAGIC %python
// MAGIC # TODO
// COMMAND ----------
// MAGIC %scala
// MAGIC // ANSWER
// MAGIC display(ssaNamesDF)
// COMMAND ----------
// MAGIC %python
// MAGIC # ANSWER
// MAGIC display(ssaNamesDF)
// COMMAND ----------
// MAGIC %md
// MAGIC **Step 2b:** Print the schema of `ssaNamesDF`.
// MAGIC *Hint:* Since this is our first time viewing a schema, try: `ssaNamesDF.printSchema()`
// COMMAND ----------
// MAGIC %scala
// MAGIC // TODO
// COMMAND ----------
// MAGIC %python
// MAGIC # TODO
// COMMAND ----------
// MAGIC %scala
// MAGIC // ANSWER
// MAGIC ssaNamesDF.printSchema
// COMMAND ----------
// MAGIC %python
// MAGIC # ANSWER
// MAGIC ssaNamesDF.printSchema()
// COMMAND ----------
// MAGIC %md
// MAGIC **Step 3:** Create a temp view named `ssa_names` so you can query the DataFrame using SQL.
// COMMAND ----------
// MAGIC %scala
// MAGIC // TODO
// COMMAND ----------
// MAGIC %python
// MAGIC # TODO
// COMMAND ----------
// MAGIC %scala
// MAGIC // ANSWER
// MAGIC ssaNamesDF.createOrReplaceTempView("ssa_names")
// COMMAND ----------
// MAGIC %python
// MAGIC # ANSWER
// MAGIC ssaNamesDF.createOrReplaceTempView("ssa_names")
// COMMAND ----------
// MAGIC %scala
// MAGIC // TEST
// MAGIC // Run this test to verify your are on the right track.
// MAGIC assert(
// MAGIC spark.sql("SELECT * FROM ssa_names LIMIT 5").as[(String, String, Int, Int)].collect.toList ==
// MAGIC List(("Jennifer","F",54336,1983), ("Jessica","F",45278,1983), ("Amanda","F",33752,1983), ("Ashley","F",33292,1983), ("Sarah","F",27228,1983))
// MAGIC )
// COMMAND ----------
// MAGIC %python
// MAGIC # TEST
// MAGIC # Run this test to verify your are on the right track.
// MAGIC assert(
// MAGIC [tuple(row) for row in spark.sql("SELECT * FROM ssa_names LIMIT 5").limit(5).collect()] ==
// MAGIC [(u'Jennifer', u'F', 54336, 1983), (u'Jessica', u'F', 45278, 1983), (u'Amanda', u'F', 33752, 1983), (u'Ashley', u'F', 33292, 1983), (u'Sarah', u'F', 27228, 1983)]
// MAGIC )
// COMMAND ----------
// MAGIC %md
// MAGIC **Step 4a:** Use SQL to find the 5 most popular girl's first names in 1990, and save the results as a DataFrame named `popularDF`.
// COMMAND ----------
// MAGIC %scala
// MAGIC // TODO
// MAGIC val popularDF = 0 // <<FILL_IN>>
// COMMAND ----------
// MAGIC %python
// MAGIC # TODO
// MAGIC popularDF = 0 # <<FILL_IN>>
// COMMAND ----------
// MAGIC %scala
// MAGIC // ANSWER
// MAGIC val sql = "SELECT firstName, total FROM ssa_names WHERE year = 1990 ORDER BY year DESC LIMIT 5"
// MAGIC val popularDF = spark.sql(sql)
// COMMAND ----------
// MAGIC %python
// MAGIC # ANSWER
// MAGIC sql = "SELECT firstName, total FROM ssa_names WHERE year = 1990 ORDER BY year DESC LIMIT 5"
// MAGIC popularDF = spark.sql(sql)
// COMMAND ----------
// MAGIC %scala
// MAGIC // Run this test case to verify your answer to the exercise above is correct.
// MAGIC assert(
// MAGIC popularDF.as[(String, Int)].collect().toSet ==
// MAGIC Set(("Jessica", 46466), ("Samantha", 25864), ("Ashley" ,45549), ("Brittany", 36535), ("Amanda", 34406))
// MAGIC )
// COMMAND ----------
// MAGIC %python
// MAGIC # Run this test case to verify your answer to the exercise above is correct.
// MAGIC assert(
// MAGIC set([(row.firstName, row.total) for row in popularDF.collect()]) ==
// MAGIC set([("Jessica", 46466), ("Samantha", 25864), ("Ashley" ,45549), ("Brittany", 36535), ("Amanda", 34406)])
// MAGIC )
// COMMAND ----------
// MAGIC %md
// MAGIC **Step 4b**: Use Databricks' `display(...)` function to view the contents of `popularDF`.
// COMMAND ----------
// MAGIC %scala
// MAGIC // TODO
// COMMAND ----------
// MAGIC %python
// MAGIC # TODO
// COMMAND ----------
// MAGIC %scala
// MAGIC // ANSWER
// MAGIC display(popularDF)
// COMMAND ----------
// MAGIC %python
// MAGIC # ANSWER
// MAGIC display(popularDF)
// COMMAND ----------
// MAGIC %md
// MAGIC <a name="Exercise2.5"></a>
// MAGIC **Step 5:** Run the same SQL statement as above, but this time use the %sql cell type in Databricks
// COMMAND ----------
// MAGIC %sql
// MAGIC -- ALL_NOTEBOOKS
// MAGIC -- TODO
// COMMAND ----------
// MAGIC %sql
// MAGIC -- ALL_NOTEBOOKS
// MAGIC -- ANSWER
// MAGIC SELECT firstName, total FROM ssa_names WHERE year = 1990 ORDER BY year DESC LIMIT 5
// COMMAND ----------
// MAGIC %md
// MAGIC ## Review
// MAGIC **Q:** How do you make a dataframe accessible via SQL?
// MAGIC **A:** `myDataFrame.createOrReplaceTempView("my_table")`
// MAGIC
// MAGIC **Q:** How do you execute SQL within python or scala code?
// MAGIC **A:** `spark.sql("SELECT * FROM my_table")`
// MAGIC
// MAGIC **Q:** What's an easy way to execute SQL from within the Databricks environment?
// MAGIC **A:** `%sql SELECT * FROM my_table`
// COMMAND ----------
// MAGIC %md
// MAGIC ## Next Steps
// MAGIC * Link to Next Lesson (if multiple courses share a notebook, we'd have multiple choices here)
// MAGIC * Link to Challenge Exercise
// COMMAND ----------
// MAGIC %md
// MAGIC ## Additional Resources
// MAGIC * Q: Where can I read more about temp views...
// MAGIC A: Link...
// COMMAND ----------
// MAGIC %md
// MAGIC ## Challenge 1
// MAGIC
// MAGIC Repeat exercise 2, but this time find the most popular girl's first name in 1885, 1915, 1945, 1975, and 2005 using a single query, sorted by ascending year. Store the result in `historicNamesDF` and also display it. You may need to introduce additional variables along the way. No hints will be provided. If you recently completed exercse 2 above, you are free to assume `ssa_names` is already registered as a SQL view.
// COMMAND ----------
// MAGIC %scala
// MAGIC // TODO
// MAGIC val ssaNamesFile = "dbfs:/mnt/training/ssn/names.parquet"
// MAGIC val historicNamesDF = 0 // <<FILL_IN>>
// COMMAND ----------
// MAGIC %python
// MAGIC # TODO
// MAGIC ssaNamesFile = "dbfs:/mnt/training/ssn/names.parquet"
// MAGIC historicNamesDF = 0 # <<FILL_IN>>
// COMMAND ----------
// MAGIC %scala
// MAGIC // ANSWER
// MAGIC val historicNamesSql = """
// MAGIC SELECT firstName, year, total
// MAGIC FROM ssa_names
// MAGIC NATURAL INNER JOIN (
// MAGIC SELECT year, gender, MAX(total) AS total
// MAGIC FROM ssa_names
// MAGIC GROUP BY year, gender
// MAGIC ) as max_names
// MAGIC WHERE gender='F' AND year IN (1885, 1915, 1945, 1975, 2005)
// MAGIC ORDER BY year
// MAGIC """
// MAGIC val historicNamesDF = spark.sql(historicNamesSql)
// MAGIC display(historicNamesDF)
// COMMAND ----------
// MAGIC %python
// MAGIC # ANSWER
// MAGIC historicNamesSql = """
// MAGIC SELECT firstName, year, total
// MAGIC FROM ssa_names
// MAGIC NATURAL INNER JOIN (
// MAGIC SELECT year, gender, MAX(total) AS total
// MAGIC FROM ssa_names
// MAGIC GROUP BY year, gender
// MAGIC ) as max_names
// MAGIC WHERE gender="F" AND year IN (1885, 1915, 1945, 1975, 2005)
// MAGIC ORDER BY year
// MAGIC """
// MAGIC historicNamesDF = spark.sql(historicNamesSql)
// MAGIC display(historicNamesDF)
// COMMAND ----------
// MAGIC %scala
// MAGIC // TEST
// MAGIC // Run this test case to verify your answer to the exercise above is correct.
// MAGIC assert(
// MAGIC historicNamesDF.as[(String, Int, Int)].collect().toList ==
// MAGIC List(("Mary",1885,9128), ("Mary",1915,58187), ("Mary",1945,59284), ("Jennifer",1975,58185), ("Emily",2005,23928))
// MAGIC )
// MAGIC
// MAGIC // Alternate approach to doing tests. It may be easier for a lay person to understand, but also less elegant.
// MAGIC assert(
// MAGIC historicNamesDF.collect().toList.toString ==
// MAGIC "List([Mary,1885,9128], [Mary,1915,58187], [Mary,1945,59284], [Jennifer,1975,58185], [Emily,2005,23928])"
// MAGIC )
// COMMAND ----------
// MAGIC %python
// MAGIC # TEST
// MAGIC # Run this test case to verify your answer to the exercise above is correct.
// MAGIC assert(
// MAGIC [(row.firstName, row.year, row.total) for row in historicNamesDF.collect()] ==
// MAGIC [(u'Mary', 1885, 9128), (u'Mary', 1915, 58187), (u'Mary', 1945, 59284), (u'Jennifer', 1975, 58185), (u'Emily', 2005, 23928)]
// MAGIC )
// MAGIC
// MAGIC # Alternate approach to doing tests. It may be easier for a lay person to understand, but also less elegant.
// MAGIC assert(
// MAGIC str(historicNamesDF.collect()) ==
// MAGIC "[Row(firstName=u'Mary', year=1885, total=9128), Row(firstName=u'Mary', year=1915, total=58187), Row(firstName=u'Mary', year=1945, total=59284), Row(firstName=u'Jennifer', year=1975, total=58185), Row(firstName=u'Emily', year=2005, total=23928)]"
// MAGIC )
-->
<link rel="stylesheet" type="text/css" href="https://databricks-prod-cloudfront.cloud.databricks.com/static/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/lib/css/bootstrap.min.css">
<link rel="stylesheet" type="text/css" href="https://databricks-prod-cloudfront.cloud.databricks.com/static/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/lib/jquery-ui-bundle/jquery-ui.min.css">
<link rel="stylesheet" type="text/css" href="https://databricks-prod-cloudfront.cloud.databricks.com/static/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/css/main.css">
<link rel="stylesheet" href="https://databricks-prod-cloudfront.cloud.databricks.com/static/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/css/print.css" media="print">
<link rel="icon" type="image/png" href="https://databricks-prod-cloudfront.cloud.databricks.com/static/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/img/favicon.ico"/>
<script>window.settings = {"enableNotebookNotifications":true,"enableSshKeyUI":true,"defaultInteractivePricePerDBU":0.4,"enableOnDemandClusterType":true,"enableAutoCompleteAsYouType":[],"devTierName":"Community Edition","enableJobsPrefetching":true,"workspaceFeaturedLinks":[{"linkURI":"https://docs.databricks.com/index.html","displayName":"Documentation","icon":"question"},{"linkURI":"https://docs.databricks.com/release-notes/product/index.html","displayName":"Release Notes","icon":"code"},{"linkURI":"https://docs.databricks.com/spark/latest/training/index.html","displayName":"Training & Tutorials","icon":"graduation-cap"}],"enableClearStateFeature":true,"enableJobsAclsV2InUI":false,"dbcForumURL":"http://forums.databricks.com/","enableProtoClusterInfoDeltaPublisher":true,"enableAttachExistingCluster":true,"resetJobListOnConnect":true,"serverlessDefaultSparkVersion":"latest-stable-scala2.11","maxCustomTags":45,"serverlessDefaultMaxWorkers":20,"enableInstanceProfilesUIInJobs":true,"nodeInfo":{"node_types":[{"support_ssh":false,"spark_heap_memory":4800,"instance_type_id":"r3.2xlarge","spark_core_oversubscription_factor":8.0,"node_type_id":"class-node","description":"Class Node","support_cluster_tags":false,"container_memory_mb":6000,"node_instance_type":{"instance_type_id":"r3.2xlarge","provider":"AWS","local_disk_size_gb":160,"compute_units":26.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":3.64,"gpus":0,"memory_mb":62464,"num_cores":8,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":6144,"is_hidden":false,"category":"Community Edition","num_cores":0.88,"support_port_forwarding":false,"support_ebs_volumes":false,"is_deprecated":false},{"display_order":1,"support_ssh":true,"num_gpus":0,"spark_heap_memory":20396,"instance_type_id":"r3.xlarge","node_type_id":"r3.xlarge","description":"r3.xlarge","support_cluster_tags":true,"container_memory_mb":25495,"node_instance_type":{"instance_type_id":"r3.xlarge","provider":"AWS","local_disk_size_gb":80,"compute_units":13.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":3.64,"gpus":0,"memory_mb":31232,"num_cores":4,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":31232,"is_hidden":false,"category":"Memory Optimized","num_cores":4.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":1,"support_ssh":true,"num_gpus":0,"spark_heap_memory":44632,"instance_type_id":"r3.2xlarge","node_type_id":"r3.2xlarge","description":"r3.2xlarge","support_cluster_tags":true,"container_memory_mb":55790,"node_instance_type":{"instance_type_id":"r3.2xlarge","provider":"AWS","local_disk_size_gb":160,"compute_units":26.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":3.64,"gpus":0,"memory_mb":62464,"num_cores":8,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":62464,"is_hidden":false,"category":"Memory Optimized","num_cores":8.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":1,"support_ssh":true,"num_gpus":0,"spark_heap_memory":93104,"instance_type_id":"r3.4xlarge","node_type_id":"r3.4xlarge","description":"r3.4xlarge","support_cluster_tags":true,"container_memory_mb":116380,"node_instance_type":{"instance_type_id":"r3.4xlarge","provider":"AWS","local_disk_size_gb":320,"compute_units":52.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":3.64,"gpus":0,"memory_mb":124928,"num_cores":16,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":124928,"is_hidden":false,"category":"Memory Optimized","num_cores":16.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":1,"support_ssh":true,"num_gpus":0,"spark_heap_memory":190048,"instance_type_id":"r3.8xlarge","node_type_id":"r3.8xlarge","description":"r3.8xlarge","support_cluster_tags":true,"container_memory_mb":237560,"node_instance_type":{"instance_type_id":"r3.8xlarge","provider":"AWS","local_disk_size_gb":320,"compute_units":104.0,"number_of_ips":4,"local_disks":2,"reserved_compute_units":3.64,"gpus":0,"memory_mb":249856,"num_cores":32,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":249856,"is_hidden":false,"category":"Memory Optimized","num_cores":32.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":20396,"instance_type_id":"r4.xlarge","node_type_id":"r4.xlarge","description":"r4.xlarge","support_cluster_tags":true,"container_memory_mb":25495,"node_instance_type":{"instance_type_id":"r4.xlarge","provider":"AWS","local_disk_size_gb":0,"compute_units":13.5,"number_of_ips":4,"local_disks":0,"reserved_compute_units":3.64,"gpus":0,"memory_mb":31232,"num_cores":4,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":31232,"is_hidden":false,"category":"Memory Optimized","num_cores":4.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":44632,"instance_type_id":"r4.2xlarge","node_type_id":"r4.2xlarge","description":"r4.2xlarge","support_cluster_tags":true,"container_memory_mb":55790,"node_instance_type":{"instance_type_id":"r4.2xlarge","provider":"AWS","local_disk_size_gb":0,"compute_units":27.0,"number_of_ips":4,"local_disks":0,"reserved_compute_units":3.64,"gpus":0,"memory_mb":62464,"num_cores":8,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":62464,"is_hidden":false,"category":"Memory Optimized","num_cores":8.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":93104,"instance_type_id":"r4.4xlarge","node_type_id":"r4.4xlarge","description":"r4.4xlarge","support_cluster_tags":true,"container_memory_mb":116380,"node_instance_type":{"instance_type_id":"r4.4xlarge","provider":"AWS","local_disk_size_gb":0,"compute_units":53.0,"number_of_ips":4,"local_disks":0,"reserved_compute_units":3.64,"gpus":0,"memory_mb":124928,"num_cores":16,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":124928,"is_hidden":false,"category":"Memory Optimized","num_cores":16.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":190048,"instance_type_id":"r4.8xlarge","node_type_id":"r4.8xlarge","description":"r4.8xlarge","support_cluster_tags":true,"container_memory_mb":237560,"node_instance_type":{"instance_type_id":"r4.8xlarge","provider":"AWS","local_disk_size_gb":0,"compute_units":99.0,"number_of_ips":4,"local_disks":0,"reserved_compute_units":3.64,"gpus":0,"memory_mb":249856,"num_cores":32,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":249856,"is_hidden":false,"category":"Memory Optimized","num_cores":32.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":1,"support_ssh":true,"num_gpus":0,"spark_heap_memory":8079,"instance_type_id":"c3.2xlarge","node_type_id":"c3.2xlarge","description":"c3.2xlarge","support_cluster_tags":true,"container_memory_mb":10099,"node_instance_type":{"instance_type_id":"c3.2xlarge","provider":"AWS","local_disk_size_gb":80,"compute_units":28.0,"number_of_ips":4,"local_disks":2,"reserved_compute_units":3.64,"gpus":0,"memory_mb":15360,"num_cores":8,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":15360,"is_hidden":false,"category":"Compute Optimized","num_cores":8.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":1,"support_ssh":true,"num_gpus":0,"spark_heap_memory":19998,"instance_type_id":"c3.4xlarge","node_type_id":"c3.4xlarge","description":"c3.4xlarge","support_cluster_tags":true,"container_memory_mb":24998,"node_instance_type":{"instance_type_id":"c3.4xlarge","provider":"AWS","local_disk_size_gb":160,"compute_units":55.0,"number_of_ips":4,"local_disks":2,"reserved_compute_units":3.64,"gpus":0,"memory_mb":30720,"num_cores":16,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":30720,"is_hidden":false,"category":"Compute Optimized","num_cores":16.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":1,"support_ssh":true,"num_gpus":0,"spark_heap_memory":43837,"instance_type_id":"c3.8xlarge","node_type_id":"c3.8xlarge","description":"c3.8xlarge","support_cluster_tags":true,"container_memory_mb":54796,"node_instance_type":{"instance_type_id":"c3.8xlarge","provider":"AWS","local_disk_size_gb":320,"compute_units":108.0,"number_of_ips":4,"local_disks":2,"reserved_compute_units":3.64,"gpus":0,"memory_mb":61440,"num_cores":32,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":61440,"is_hidden":false,"category":"Compute Optimized","num_cores":32.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":8079,"instance_type_id":"c4.2xlarge","node_type_id":"c4.2xlarge","description":"c4.2xlarge","support_cluster_tags":true,"container_memory_mb":10099,"node_instance_type":{"instance_type_id":"c4.2xlarge","provider":"AWS","local_disk_size_gb":0,"compute_units":31.0,"number_of_ips":4,"local_disks":0,"reserved_compute_units":3.64,"gpus":0,"memory_mb":15360,"num_cores":8,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":15360,"is_hidden":false,"category":"Compute Optimized","num_cores":8.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":19998,"instance_type_id":"c4.4xlarge","node_type_id":"c4.4xlarge","description":"c4.4xlarge","support_cluster_tags":true,"container_memory_mb":24998,"node_instance_type":{"instance_type_id":"c4.4xlarge","provider":"AWS","local_disk_size_gb":0,"compute_units":62.0,"number_of_ips":4,"local_disks":0,"reserved_compute_units":3.64,"gpus":0,"memory_mb":30720,"num_cores":16,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":30720,"is_hidden":false,"category":"Compute Optimized","num_cores":16.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":43837,"instance_type_id":"c4.8xlarge","node_type_id":"c4.8xlarge","description":"c4.8xlarge","support_cluster_tags":true,"container_memory_mb":54796,"node_instance_type":{"instance_type_id":"c4.8xlarge","provider":"AWS","local_disk_size_gb":0,"compute_units":132.0,"number_of_ips":4,"local_disks":0,"reserved_compute_units":3.64,"gpus":0,"memory_mb":61440,"num_cores":36,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":61440,"is_hidden":false,"category":"Compute Optimized","num_cores":36.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":1,"support_ssh":true,"num_gpus":0,"spark_heap_memory":20396,"instance_type_id":"i2.xlarge","node_type_id":"i2.xlarge","description":"i2.xlarge","support_cluster_tags":true,"container_memory_mb":25495,"node_instance_type":{"instance_type_id":"i2.xlarge","provider":"AWS","local_disk_size_gb":800,"compute_units":14.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":3.64,"gpus":0,"memory_mb":31232,"num_cores":4,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":31232,"is_hidden":false,"category":"Storage Optimized","num_cores":4.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":1,"support_ssh":true,"num_gpus":0,"spark_heap_memory":44632,"instance_type_id":"i2.2xlarge","node_type_id":"i2.2xlarge","description":"i2.2xlarge","support_cluster_tags":true,"container_memory_mb":55790,"node_instance_type":{"instance_type_id":"i2.2xlarge","provider":"AWS","local_disk_size_gb":800,"compute_units":27.0,"number_of_ips":4,"local_disks":2,"reserved_compute_units":3.64,"gpus":0,"memory_mb":62464,"num_cores":8,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":62464,"is_hidden":false,"category":"Storage Optimized","num_cores":8.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":1,"support_ssh":true,"num_gpus":0,"spark_heap_memory":93104,"instance_type_id":"i2.4xlarge","node_type_id":"i2.4xlarge","description":"i2.4xlarge","support_cluster_tags":true,"container_memory_mb":116380,"node_instance_type":{"instance_type_id":"i2.4xlarge","provider":"AWS","local_disk_size_gb":800,"compute_units":53.0,"number_of_ips":4,"local_disks":4,"reserved_compute_units":3.64,"gpus":0,"memory_mb":124928,"num_cores":16,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":124928,"is_hidden":false,"category":"Storage Optimized","num_cores":16.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":1,"support_ssh":true,"num_gpus":0,"spark_heap_memory":190048,"instance_type_id":"i2.8xlarge","node_type_id":"i2.8xlarge","description":"i2.8xlarge","support_cluster_tags":true,"container_memory_mb":237560,"node_instance_type":{"instance_type_id":"i2.8xlarge","provider":"AWS","local_disk_size_gb":800,"compute_units":104.0,"number_of_ips":4,"local_disks":8,"reserved_compute_units":3.64,"gpus":0,"memory_mb":249856,"num_cores":32,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":249856,"is_hidden":false,"category":"Storage Optimized","num_cores":32.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":20396,"instance_type_id":"i3.xlarge","node_type_id":"i3.xlarge","description":"i3.xlarge","support_cluster_tags":true,"container_memory_mb":25495,"node_instance_type":{"instance_type_id":"i3.xlarge","provider":"AWS","local_disk_size_gb":950,"compute_units":13.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":3.64,"gpus":0,"memory_mb":31232,"num_cores":4,"local_disk_type":"NVME","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":31232,"is_hidden":false,"category":"Storage Optimized","num_cores":4.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":44632,"instance_type_id":"i3.2xlarge","node_type_id":"i3.2xlarge","description":"i3.2xlarge","support_cluster_tags":true,"container_memory_mb":55790,"node_instance_type":{"instance_type_id":"i3.2xlarge","provider":"AWS","local_disk_size_gb":1900,"compute_units":27.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":3.64,"gpus":0,"memory_mb":62464,"num_cores":8,"local_disk_type":"NVME","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":62464,"is_hidden":false,"category":"Storage Optimized","num_cores":8.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":93104,"instance_type_id":"i3.4xlarge","node_type_id":"i3.4xlarge","description":"i3.4xlarge","support_cluster_tags":true,"container_memory_mb":116380,"node_instance_type":{"instance_type_id":"i3.4xlarge","provider":"AWS","local_disk_size_gb":1900,"compute_units":53.0,"number_of_ips":4,"local_disks":2,"reserved_compute_units":3.64,"gpus":0,"memory_mb":124928,"num_cores":16,"local_disk_type":"NVME","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":124928,"is_hidden":false,"category":"Storage Optimized","num_cores":16.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":190048,"instance_type_id":"i3.8xlarge","node_type_id":"i3.8xlarge","description":"i3.8xlarge","support_cluster_tags":true,"container_memory_mb":237560,"node_instance_type":{"instance_type_id":"i3.8xlarge","provider":"AWS","local_disk_size_gb":1900,"compute_units":99.0,"number_of_ips":4,"local_disks":4,"reserved_compute_units":3.64,"gpus":0,"memory_mb":249856,"num_cores":32,"local_disk_type":"NVME","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":249856,"is_hidden":false,"category":"Storage Optimized","num_cores":32.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":1,"spark_heap_memory":44632,"instance_type_id":"p2.xlarge","node_type_id":"p2.xlarge","description":"p2.xlarge (beta)","support_cluster_tags":true,"container_memory_mb":55790,"node_instance_type":{"instance_type_id":"p2.xlarge","provider":"AWS","local_disk_size_gb":0,"compute_units":12.0,"number_of_ips":4,"local_disks":0,"reserved_compute_units":3.64,"gpus":1,"memory_mb":62464,"num_cores":4,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":62464,"is_hidden":false,"category":"GPU Accelerated","num_cores":4.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":8,"spark_heap_memory":383936,"instance_type_id":"p2.8xlarge","node_type_id":"p2.8xlarge","description":"p2.8xlarge (beta)","support_cluster_tags":true,"container_memory_mb":479920,"node_instance_type":{"instance_type_id":"p2.8xlarge","provider":"AWS","local_disk_size_gb":0,"compute_units":94.0,"number_of_ips":4,"local_disks":0,"reserved_compute_units":3.64,"gpus":8,"memory_mb":499712,"num_cores":32,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":499712,"is_hidden":false,"category":"GPU Accelerated","num_cores":32.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":16,"spark_heap_memory":577824,"instance_type_id":"p2.16xlarge","node_type_id":"p2.16xlarge","description":"p2.16xlarge (beta)","support_cluster_tags":true,"container_memory_mb":722280,"node_instance_type":{"instance_type_id":"p2.16xlarge","provider":"AWS","local_disk_size_gb":0,"compute_units":188.0,"number_of_ips":4,"local_disks":0,"reserved_compute_units":3.64,"gpus":16,"memory_mb":749568,"num_cores":64,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":749568,"is_hidden":false,"category":"GPU Accelerated","num_cores":64.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":8873,"instance_type_id":"m4.xlarge","node_type_id":"m4.xlarge","description":"m4.xlarge","support_cluster_tags":true,"container_memory_mb":11092,"node_instance_type":{"instance_type_id":"m4.xlarge","provider":"AWS","local_disk_size_gb":0,"compute_units":13.0,"number_of_ips":4,"local_disks":0,"reserved_compute_units":3.64,"gpus":0,"memory_mb":16384,"num_cores":4,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":16384,"is_hidden":false,"category":"General Purpose","num_cores":4.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":21587,"instance_type_id":"m4.2xlarge","node_type_id":"m4.2xlarge","description":"m4.2xlarge","support_cluster_tags":true,"container_memory_mb":26984,"node_instance_type":{"instance_type_id":"m4.2xlarge","provider":"AWS","local_disk_size_gb":0,"compute_units":26.0,"number_of_ips":4,"local_disks":0,"reserved_compute_units":3.64,"gpus":0,"memory_mb":32768,"num_cores":8,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":32768,"is_hidden":false,"category":"General Purpose","num_cores":8.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":47015,"instance_type_id":"m4.4xlarge","node_type_id":"m4.4xlarge","description":"m4.4xlarge","support_cluster_tags":true,"container_memory_mb":58769,"node_instance_type":{"instance_type_id":"m4.4xlarge","provider":"AWS","local_disk_size_gb":0,"compute_units":53.5,"number_of_ips":4,"local_disks":0,"reserved_compute_units":3.64,"gpus":0,"memory_mb":65536,"num_cores":16,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":65536,"is_hidden":false,"category":"General Purpose","num_cores":16.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":123299,"instance_type_id":"m4.10xlarge","node_type_id":"m4.10xlarge","description":"m4.10xlarge","support_cluster_tags":true,"container_memory_mb":154124,"node_instance_type":{"instance_type_id":"m4.10xlarge","provider":"AWS","local_disk_size_gb":0,"compute_units":124.5,"number_of_ips":4,"local_disks":0,"reserved_compute_units":3.64,"gpus":0,"memory_mb":163840,"num_cores":40,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":163840,"is_hidden":false,"category":"General Purpose","num_cores":40.0,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"support_ssh":false,"spark_heap_memory":23800,"instance_type_id":"r3.2xlarge","node_type_id":"memory-optimized","description":"Memory Optimized (legacy)","support_cluster_tags":false,"container_memory_mb":28000,"node_instance_type":{"instance_type_id":"r3.2xlarge","provider":"AWS","local_disk_size_gb":160,"compute_units":26.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":3.64,"gpus":0,"memory_mb":62464,"num_cores":8,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":30720,"is_hidden":true,"category":"Memory Optimized","num_cores":4.0,"support_port_forwarding":false,"support_ebs_volumes":false,"is_deprecated":true},{"support_ssh":false,"spark_heap_memory":9702,"instance_type_id":"c3.4xlarge","node_type_id":"compute-optimized","description":"Compute Optimized (legacy)","support_cluster_tags":false,"container_memory_mb":12128,"node_instance_type":{"instance_type_id":"c3.4xlarge","provider":"AWS","local_disk_size_gb":160,"compute_units":55.0,"number_of_ips":4,"local_disks":2,"reserved_compute_units":3.64,"gpus":0,"memory_mb":30720,"num_cores":16,"local_disk_type":"AHCI","max_attachable_disks":10,"supported_disk_types":[{"ebs_volume_type":"GENERAL_PURPOSE_SSD"},{"ebs_volume_type":"THROUGHPUT_OPTIMIZED_HDD"}],"reserved_memory_mb":4800},"memory_mb":15360,"is_hidden":true,"category":"Compute Optimized","num_cores":8.0,"support_port_forwarding":false,"support_ebs_volumes":false,"is_deprecated":true}],"default_node_type_id":"i3.xlarge"},"sqlAclsDisabledMap":{"spark.databricks.acl.enabled":"false","spark.databricks.acl.sqlOnly":"false"},"enableDatabaseSupportClusterChoice":true,"enableClusterAcls":true,"notebookRevisionVisibilityHorizon":0,"serverlessClusterProductName":"Serverless Pool","maxEbsVolumesPerInstance":10,"isAdmin":true,"deltaProcessingBatchSize":1000,"timerUpdateQueueLength":100,"sqlAclsEnabledMap":{"spark.databricks.acl.enabled":"true","spark.databricks.acl.sqlOnly":"true"},"enableLargeResultDownload":true,"maxElasticDiskCapacityGB":5000,"enableManageAccountTab":true,"serverlessDefaultMinWorkers":2,"zoneInfos":[{"id":"us-west-2a","isDefault":true},{"id":"us-west-2c","isDefault":false},{"id":"us-west-2b","isDefault":false}],"enableCustomSpotPricingUIByTier":true,"serverlessClustersEnabled":true,"enableFindAndReplace":true,"enableEBSVolumesUIForJobs":true,"enablePublishNotebooks":true,"enableBitbucketCloud":false,"enableMaxConcurrentRuns":true,"enableJobAclsConfig":false,"enableFullTextSearch":true,"enableElasticSparkUI":true,"enableNewClustersCreate":true,"clusters":true,"allowRunOnPendingClusters":true,"useAutoscalingByDefault":true,"enableAzureToolbar":false,"fileStoreBase":"FileStore","enableRLibraries":false,"enableSshKeyUIInJobs":true,"enableDetachAndAttachSubMenu":true,"configurableSparkOptionsSpec":[{"keyPattern":"spark\\.kryo(\\.[^\\.]+)+","valuePattern":".*","keyPatternDisplay":"spark.kryo.*","valuePatternDisplay":"*","description":"Configuration options for Kryo serialization"},{"keyPattern":"spark\\.io\\.compression\\.codec","valuePattern":"(lzf|snappy|org\\.apache\\.spark\\.io\\.LZFCompressionCodec|org\\.apache\\.spark\\.io\\.SnappyCompressionCodec)","keyPatternDisplay":"spark.io.compression.codec","valuePatternDisplay":"snappy|lzf","description":"The codec used to compress internal data such as RDD partitions, broadcast variables and shuffle outputs."},{"keyPattern":"spark\\.serializer","valuePattern":"(org\\.apache\\.spark\\.serializer\\.JavaSerializer|org\\.apache\\.spark\\.serializer\\.KryoSerializer)","keyPatternDisplay":"spark.serializer","valuePatternDisplay":"org.apache.spark.serializer.JavaSerializer|org.apache.spark.serializer.KryoSerializer","description":"Class to use for serializing objects that will be sent over the network or need to be cached in serialized form."},{"keyPattern":"spark\\.rdd\\.compress","valuePattern":"(true|false)","keyPatternDisplay":"spark.rdd.compress","valuePatternDisplay":"true|false","description":"Whether to compress serialized RDD partitions (e.g. for StorageLevel.MEMORY_ONLY_SER). Can save substantial space at the cost of some extra CPU time."},{"keyPattern":"spark\\.speculation","valuePattern":"(true|false)","keyPatternDisplay":"spark.speculation","valuePatternDisplay":"true|false","description":"Whether to use speculation (recommended off for streaming)"},{"keyPattern":"spark\\.es(\\.[^\\.]+)+","valuePattern":".*","keyPatternDisplay":"spark.es.*","valuePatternDisplay":"*","description":"Configuration options for ElasticSearch"},{"keyPattern":"es(\\.([^\\.]+))+","valuePattern":".*","keyPatternDisplay":"es.*","valuePatternDisplay":"*","description":"Configuration options for ElasticSearch"},{"keyPattern":"spark\\.(storage|shuffle)\\.memoryFraction","valuePattern":"0?\\.0*([1-9])([0-9])*","keyPatternDisplay":"spark.(storage|shuffle).memoryFraction","valuePatternDisplay":"(0.0,1.0)","description":"Fraction of Java heap to use for Spark's shuffle or storage"},{"keyPattern":"spark\\.streaming\\.backpressure\\.enabled","valuePattern":"(true|false)","keyPatternDisplay":"spark.streaming.backpressure.enabled","valuePatternDisplay":"true|false","description":"Enables or disables Spark Streaming's internal backpressure mechanism (since 1.5). This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process. Internally, this dynamically sets the maximum receiving rate of receivers. This rate is upper bounded by the values `spark.streaming.receiver.maxRate` and `spark.streaming.kafka.maxRatePerPartition` if they are set."},{"keyPattern":"spark\\.streaming\\.receiver\\.maxRate","valuePattern":"^([0-9]{1,})$","keyPatternDisplay":"spark.streaming.receiver.maxRate","valuePatternDisplay":"numeric","description":"Maximum rate (number of records per second) at which each receiver will receive data. Effectively, each stream will consume at most this number of records per second. Setting this configuration to 0 or a negative number will put no limit on the rate. See the deployment guide in the Spark Streaming programing guide for mode details."},{"keyPattern":"spark\\.streaming\\.kafka\\.maxRatePerPartition","valuePattern":"^([0-9]{1,})$","keyPatternDisplay":"spark.streaming.kafka.maxRatePerPartition","valuePatternDisplay":"numeric","description":"Maximum rate (number of records per second) at which data will be read from each Kafka partition when using the Kafka direct stream API introduced in Spark 1.3. See the Kafka Integration guide for more details."},{"keyPattern":"spark\\.streaming\\.kafka\\.maxRetries","valuePattern":"^([0-9]{1,})$","keyPatternDisplay":"spark.streaming.kafka.maxRetries","valuePatternDisplay":"numeric","description":"Maximum number of consecutive retries the driver will make in order to find the latest offsets on the leader of each partition (a default value of 1 means that the driver will make a maximum of 2 attempts). Only applies to the Kafka direct stream API introduced in Spark 1.3."},{"keyPattern":"spark\\.streaming\\.ui\\.retainedBatches","valuePattern":"^([0-9]{1,})$","keyPatternDisplay":"spark.streaming.ui.retainedBatches","valuePatternDisplay":"numeric","description":"How many batches the Spark Streaming UI and status APIs remember before garbage collecting."},{"keyPattern":"\\.*","valuePattern":"\\.*","keyPatternDisplay":"","valuePatternDisplay":"","description":""}],"enableReactNotebookComments":true,"bitbucketCloudBaseApiV1Url":"https://api.bitbucket.org/1.0","enableAdminPasswordReset":false,"checkBeforeAddingAadUser":false,"enableResetPassword":true,"maxClusterTagValueLength":255,"enableJobsSparkUpgrade":true,"perClusterAutoterminationEnabled":true,"enableNotebookCommandNumbers":true,"sparkVersions":[{"key":"1.6.3-db2-hadoop2-scala2.10","displayName":"Spark 1.6.3-db2 (Hadoop 2, Scala 2.10)","packageLabel":"spark-image-aba860a0ffce4f3471fb14aefdcb1d768ac66a53a5ad884c48745ef98aeb9d67","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"2.1.1-db5-scala2.11","displayName":"Spark 2.1.1-db5 (Scala 2.11)","packageLabel":"spark-image-08d9fc1551087e0876236f19640c4a83116b1649f15137427d21c9056656e80e","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"1.6.x-ubuntu15.10","displayName":"Spark 1.6.x (Hadoop 1)","packageLabel":"spark-image-8cea23fb9094e174bf5815d79009f4a8e383eb86cf2909cf6c6434ed8da2a16a","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"1.4.x-ubuntu15.10","displayName":"Spark 1.4.1 (Hadoop 1, deprecated)","packageLabel":"spark-image-f710650fb8aaade8e4e812368ea87c45cd8cd0b5e6894ca6c94f3354e8daa6dc","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"2.2.x-scala2.11","displayName":"3.0 (includes Apache Spark 2.2.0, Scala 2.11)","packageLabel":"spark-image-67ab3a06d1e83d5b60df7063245eb419a2e9fe329aeeb7e7d9713332c669bb17","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"2.1.1-db6-scala2.10","displayName":"Spark 2.1.1-db6 (Scala 2.10)","packageLabel":"spark-image-177f3f02a6a3432d30068332dc857b9161345bdd2ee8a2d2de05bb05cb4b0f4c","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"2.1.0-db2-scala2.11","displayName":"Spark 2.1.0-db2 (Scala 2.11)","packageLabel":"spark-image-267c4490a3ab8a39acdbbd9f1d36f6decdecebf013e30dd677faff50f1d9cf8b","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"2.1.x-gpu-scala2.11","displayName":"Spark 2.1 (Auto-updating, GPU, Scala 2.11 experimental)","packageLabel":"spark-image-d613235f93e0f29838beb2079a958c02a192ed67a502192bc67a8a5f2fb37f35","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"2.0.0-ubuntu15.10-scala2.10","displayName":"Spark 2.0.0 (Scala 2.10)","packageLabel":"spark-image-073c1b52ace74f251fae2680624a0d8d184a8b57096d1c21c5ce56c29be6a37a","upgradable":true,"deprecated":true,"customerVisible":false},{"key":"2.0.2-db3-scala2.10","displayName":"Spark 2.0.2-db3 (Scala 2.10)","packageLabel":"spark-image-584091dedb690de20e8cf22d9e02fdcce1281edda99eedb441a418d50e28088f","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"3.2.x-scala2.10","displayName":"3.2 (includes Apache Spark 2.2.0, Scala 2.10)","packageLabel":"spark-image-41f897ece30937788762beadd02cfc0e713bed96dfb4720caa05b6a36054c97a","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"latest-experimental-scala2.10","displayName":"Latest experimental (3.3 snapshot, Scala 2.10)","packageLabel":"spark-image-9ff419cbb340316368aaeeec4fbad17f913484e64f22d1aa64b92ed8ebeed811","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"2.1.0-db1-scala2.11","displayName":"Spark 2.1.0-db1 (Scala 2.11)","packageLabel":"spark-image-e8ad5b72cf0f899dcf2b4720c1f572ab0e87a311d6113b943b4e1d4a7edb77eb","upgradable":true,"deprecated":true,"customerVisible":false},{"key":"2.1.1-db4-scala2.11","displayName":"Spark 2.1.1-db4 (Scala 2.11)","packageLabel":"spark-image-52bca0ca866e3f4243d3820a783abf3b9b3b553edf234abef14b892657ceaca9","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"latest-rc-scala2.11","displayName":"Latest RC (3.3 snapshot, Scala 2.11)","packageLabel":"spark-image-5d7ad1a3593e894f15be9edf93ab9475a0b69044fcfbe22c8dcf256aa6bc7476","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"latest-stable-scala2.11","displayName":"Latest stable (3.2, Scala 2.11)","packageLabel":"spark-image-43a5607abde06ae819f147c7619fb7ad5fee521724009ab575b347cea006d7ef","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"2.1.0-db2-scala2.10","displayName":"Spark 2.1.0-db2 (Scala 2.10)","packageLabel":"spark-image-a2ca4f6b58c95f78dca91b1340305ab3fe32673bd894da2fa8e1dc8a9f8d0478","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"1.6.x-ubuntu15.10-hadoop1","displayName":"Spark 1.6.x (Hadoop 1)","packageLabel":"spark-image-8cea23fb9094e174bf5815d79009f4a8e383eb86cf2909cf6c6434ed8da2a16a","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"2.0.2-db4-scala2.11","displayName":"Spark 2.0.2-db4 (Scala 2.11)","packageLabel":"spark-image-7dbc7583e8271765b8a1508cb9e832768e35489bbde2c4c790bc6766aee2fd7f","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"1.6.1-ubuntu15.10-hadoop1","displayName":"Spark 1.6.1 (Hadoop 1)","packageLabel":"spark-image-21d1cac181b7b8856dd1b4214a3a734f95b5289089349db9d9c926cb87d843db","upgradable":true,"deprecated":true,"customerVisible":false},{"key":"2.0.x-gpu-scala2.11","displayName":"Spark 2.0 (Auto-updating, GPU, Scala 2.11 experimental)","packageLabel":"spark-image-968b89f1d0ec32e1ee4dacd04838cae25ef44370a441224177a37980d539d83a","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"1.6.2-ubuntu15.10-hadoop1","displayName":"Spark 1.6.2 (Hadoop 1)","packageLabel":"spark-image-8cea23fb9094e174bf5815d79009f4a8e383eb86cf2909cf6c6434ed8da2a16a","upgradable":true,"deprecated":true,"customerVisible":false},{"key":"1.6.3-db1-hadoop2-scala2.10","displayName":"Spark 1.6.3-db1 (Hadoop 2, Scala 2.10)","packageLabel":"spark-image-eaa8d9b990015a14e032fb2e2e15be0b8d5af9627cd01d855df728b67969d5d9","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"1.6.3-db2-hadoop1-scala2.10","displayName":"Spark 1.6.3-db2 (Hadoop 1, Scala 2.10)","packageLabel":"spark-image-14112ea0645bea94333a571a150819ce85573cf5541167d905b7e6588645cf3b","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"1.6.2-ubuntu15.10-hadoop2","displayName":"Spark 1.6.2 (Hadoop 2)","packageLabel":"spark-image-161245e66d887cd775e23286a54bab0b146143e1289f25bd1732beac454a1561","upgradable":true,"deprecated":true,"customerVisible":false},{"key":"1.6.1-ubuntu15.10-hadoop2","displayName":"Spark 1.6.1 (Hadoop 2)","packageLabel":"spark-image-4cafdf8bc6cba8edad12f441e3b3f0a8ea27da35c896bc8290e16b41fd15496a","upgradable":true,"deprecated":true,"customerVisible":false},{"key":"2.0.2-db2-scala2.10","displayName":"Spark 2.0.2-db2 (Scala 2.10)","packageLabel":"spark-image-36d48f22cca7a907538e07df71847dd22aaf84a852c2eeea2dcefe24c681602f","upgradable":true,"deprecated":true,"customerVisible":false},{"key":"2.0.x-ubuntu15.10-scala2.11","displayName":"Spark 2.0 (Ubuntu 15.10, Scala 2.11, deprecated)","packageLabel":"spark-image-8e1c50d626a52eac5a6c8129e09ae206ba9890f4523775f77af4ad6d99a64c44","upgradable":true,"deprecated":true,"customerVisible":false},{"key":"2.0.x-scala2.10","displayName":"Spark 2.0 (Auto-updating, Scala 2.10)","packageLabel":"spark-image-859e88079f97f58d50e25163b39a1943d1eeac0b6939c5a65faba986477e311a","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"2.1.1-db4-scala2.10","displayName":"Spark 2.1.1-db4 (Scala 2.10)","packageLabel":"spark-image-c7c0224de396cd1563addc1ae4bca6ba823780b6babe6c3729ddf73008f29ba4","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"latest-rc-scala2.10","displayName":"Latest RC (3.3 snapshot, Scala 2.10)","packageLabel":"spark-image-9ff419cbb340316368aaeeec4fbad17f913484e64f22d1aa64b92ed8ebeed811","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"latest-stable-scala2.10","displayName":"Latest stable (3.2, Scala 2.10)","packageLabel":"spark-image-41f897ece30937788762beadd02cfc0e713bed96dfb4720caa05b6a36054c97a","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"2.0.2-db1-scala2.11","displayName":"Spark 2.0.2-db1 (Scala 2.11)","packageLabel":"spark-image-c2d623f03dd44097493c01aa54a941fc31978ebe6d759b36c75b716b2ff6ab9c","upgradable":true,"deprecated":true,"customerVisible":false},{"key":"2.0.2-db4-scala2.10","displayName":"Spark 2.0.2-db4 (Scala 2.10)","packageLabel":"spark-image-859e88079f97f58d50e25163b39a1943d1eeac0b6939c5a65faba986477e311a","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"2.1.1-db5-scala2.10","displayName":"Spark 2.1.1-db5 (Scala 2.10)","packageLabel":"spark-image-74133df2c13950431298d1cab3e865c191d83ac33648a8590495c52fc644c654","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"1.5.x-ubuntu15.10","displayName":"Spark 1.5.2 (Hadoop 1, deprecated)","packageLabel":"spark-image-c9d2a8abf41f157a4acc6d52bc721090346f6fea2de356f3a66e388f54481698","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"2.2.x-scala2.10","displayName":"3.0 (includes Apache Spark 2.2.0, Scala 2.10)","packageLabel":"spark-image-d549f2d4a523994ecdf37e531b51d5ec7d8be51534bb0ca5322eaad28ba8f557","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"3.0.x-scala2.11","displayName":"3.0 (includes Apache Spark 2.2.0, Scala 2.11)","packageLabel":"spark-image-67ab3a06d1e83d5b60df7063245eb419a2e9fe329aeeb7e7d9713332c669bb17","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"2.0.x-scala2.11","displayName":"Spark 2.0 (Auto-updating, Scala 2.11)","packageLabel":"spark-image-7dbc7583e8271765b8a1508cb9e832768e35489bbde2c4c790bc6766aee2fd7f","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"2.1.x-scala2.10","displayName":"Spark 2.1 (Auto-updating, Scala 2.10)","packageLabel":"spark-image-177f3f02a6a3432d30068332dc857b9161345bdd2ee8a2d2de05bb05cb4b0f4c","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"3.1.x-scala2.11","displayName":"3.1 (includes Apache Spark 2.2.0, Scala 2.11)","packageLabel":"spark-image-0e50da7a73d3bc76fcca8ac8a67c1e036c5a30deb5663adf6d0f332cc8ec2c90","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"2.1.0-db3-scala2.10","displayName":"Spark 2.1.0-db3 (Scala 2.10)","packageLabel":"spark-image-25a17d070af155f10c4232dcc6248e36a2eb48c24f8d4fc00f34041b86bd1626","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"2.0.2-db2-scala2.11","displayName":"Spark 2.0.2-db2 (Scala 2.11)","packageLabel":"spark-image-4fa852ba378e97815083b96c9cada7b962a513ec23554a5fc849f7f1dd8c065a","upgradable":true,"deprecated":true,"customerVisible":false},{"key":"3.1.x-scala2.10","displayName":"3.1 (includes Apache Spark 2.2.0, Scala 2.10)","packageLabel":"spark-image-0c03d5f78139022c37032b5f3ffac5b5a4445d9cab8733e333f16a67a47262f9","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"1.3.x-ubuntu15.10","displayName":"Spark 1.3.0 (Hadoop 1, deprecated)","packageLabel":"spark-image-40d2842670bc3dc178b14042501847d76171437ccf70613fa397a7a24c48b912","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"2.0.1-db1-scala2.11","displayName":"Spark 2.0.1-db1 (Scala 2.11)","packageLabel":"spark-image-10ab19f634bbfdb860446c326a9f76dc25bfa87de6403b980566279142a289ea","upgradable":true,"deprecated":true,"customerVisible":false},{"key":"2.0.2-db3-scala2.11","displayName":"Spark 2.0.2-db3 (Scala 2.11)","packageLabel":"spark-image-7fd7aaa89d55692e429115ae7eac3b1a1dc4de705d50510995f34306b39c2397","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"2.1.1-db6-scala2.11","displayName":"Spark 2.1.1-db6 (Scala 2.11)","packageLabel":"spark-image-fdad9ef557700d7a8b6bde86feccbcc3c71d1acdc838b0fd299bd19956b1076e","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"1.6.3-db1-hadoop1-scala2.10","displayName":"Spark 1.6.3-db1 (Hadoop 1, Scala 2.10)","packageLabel":"spark-image-d50af1032799546b8ccbeeb76889a20c819ebc2a0e68ea20920cb30d3895d3ae","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"2.0.2-db1-scala2.10","displayName":"Spark 2.0.2-db1 (Scala 2.10)","packageLabel":"spark-image-654bdd6e9bad70079491987d853b4b7abf3b736fff099701501acaabe0e75c41","upgradable":true,"deprecated":true,"customerVisible":false},{"key":"2.0.x-ubuntu15.10","displayName":"Spark 2.0 (Ubuntu 15.10, Scala 2.10, deprecated)","packageLabel":"spark-image-a659f3909d51b38d297b20532fc807ecf708cfb7440ce9b090c406ab0c1e4b7e","upgradable":true,"deprecated":true,"customerVisible":false},{"key":"latest-experimental-scala2.11","displayName":"Latest experimental (3.3 snapshot, Scala 2.11)","packageLabel":"spark-image-5d7ad1a3593e894f15be9edf93ab9475a0b69044fcfbe22c8dcf256aa6bc7476","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"3.2.x-scala2.11","displayName":"3.2 (includes Apache Spark 2.2.0, Scala 2.11)","packageLabel":"spark-image-43a5607abde06ae819f147c7619fb7ad5fee521724009ab575b347cea006d7ef","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"2.0.1-db1-scala2.10","displayName":"Spark 2.0.1-db1 (Scala 2.10)","packageLabel":"spark-image-5a13c2db3091986a4e7363006cc185c5b1108c7761ef5d0218506cf2e6643840","upgradable":true,"deprecated":true,"customerVisible":false},{"key":"2.1.x-scala2.11","displayName":"Spark 2.1 (Auto-updating, Scala 2.11)","packageLabel":"spark-image-fdad9ef557700d7a8b6bde86feccbcc3c71d1acdc838b0fd299bd19956b1076e","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"2.1.0-db1-scala2.10","displayName":"Spark 2.1.0-db1 (Scala 2.10)","packageLabel":"spark-image-f0ab82a5deb7908e0d159e9af066ba05fb56e1edb35bdad41b7ad2fd62a9b546","upgradable":true,"deprecated":true,"customerVisible":false},{"key":"3.0.x-scala2.10","displayName":"3.0 (includes Apache Spark 2.2.0, Scala 2.10)","packageLabel":"spark-image-d549f2d4a523994ecdf37e531b51d5ec7d8be51534bb0ca5322eaad28ba8f557","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"1.6.0-ubuntu15.10","displayName":"Spark 1.6.0 (Hadoop 1)","packageLabel":"spark-image-10ef758029b8c7e19cd7f4fb52fff9180d75db92ca071bd94c47f3c1171a7cb5","upgradable":true,"deprecated":true,"customerVisible":false},{"key":"1.6.x-ubuntu15.10-hadoop2","displayName":"Spark 1.6.x (Hadoop 2)","packageLabel":"spark-image-161245e66d887cd775e23286a54bab0b146143e1289f25bd1732beac454a1561","upgradable":true,"deprecated":false,"customerVisible":false},{"key":"2.0.0-ubuntu15.10-scala2.11","displayName":"Spark 2.0.0 (Scala 2.11)","packageLabel":"spark-image-b4ec141e751f201399f8358a82efee202560f7ed05e1a04a2ae8778f6324b909","upgradable":true,"deprecated":true,"customerVisible":false},{"key":"2.1.0-db3-scala2.11","displayName":"Spark 2.1.0-db3 (Scala 2.11)","packageLabel":"spark-image-ccbc6b73f158e2001fc1fb8c827bfdde425d8bd6d65cb7b3269784c28bb72c16","upgradable":true,"deprecated":false,"customerVisible":false}],"enablePresentationMode":false,"enableClearStateAndRunAll":true,"enableRestrictedClusterCreation":false,"enableFeedback":true,"enableClusterAutoScaling":true,"enableUserVisibleDefaultTags":true,"defaultNumWorkers":8,"serverContinuationTimeoutMillis":10000,"driverStderrFilePrefix":"stderr","enableNotebookRefresh":true,"accountsOwnerUrl":"https://accounts.cloud.databricks.com/registration.html#login","driverStdoutFilePrefix":"stdout","showDbuPricing":true,"defaultNodeTypeToPricingUnitsMap":{"r3.2xlarge":2,"i3.4xlarge":4,"class-node":1,"m4.2xlarge":1.5,"r4.xlarge":1,"m4.4xlarge":3,"r4.16xlarge":16,"Standard_DS11":0.5,"p2.8xlarge":16,"m4.10xlarge":8,"r3.8xlarge":8,"r4.4xlarge":4,"dev-tier-node":1,"c3.8xlarge":4,"r3.4xlarge":4,"i2.4xlarge":6,"m4.xlarge":0.75,"r4.8xlarge":8,"r4.large":0.5,"Standard_DS12":1,"development-node":1,"i2.2xlarge":3,"g2.8xlarge":6,"i3.large":0.75,"memory-optimized":1,"m4.large":0.375,"p2.16xlarge":24,"i3.8xlarge":8,"i3.16xlarge":16,"Standard_DS12_v2":1,"Standard_DS13":2,"Standard_DS11_v2":0.5,"Standard_DS13_v2":2,"c3.2xlarge":1,"Standard_L4s":1.5,"c4.2xlarge":1,"i2.xlarge":1.5,"compute-optimized":1,"c4.4xlarge":2,"i3.2xlarge":2,"c3.4xlarge":2,"g2.2xlarge":1.5,"p2.xlarge":2,"m4.16xlarge":12,"c4.8xlarge":4,"i3.xlarge":1,"r3.xlarge":1,"r4.2xlarge":2,"i2.8xlarge":12},"enableSparkDocsSearch":true,"sparkHistoryServerEnabled":true,"enableEBSVolumesUI":true,"metastoreServiceRowLimit":1000000,"enableIPythonImportExport":true,"enableClusterTagsUIForJobs":true,"enableClusterTagsUI":true,"enableNotebookHistoryDiffing":true,"branch":"2.53.632","accountsLimit":-1,"enableSparkEnvironmentVariables":true,"enableX509Authentication":false,"enableStructuredStreamingNbOptimizations":true,"enableNotebookGitBranching":true,"local":false,"enableNotebookLazyRenderWrapper":false,"enableClusterAutoScalingForJobs":false,"enableStrongPassword":false,"displayDefaultContainerMemoryGB":30,"enableNotebookCommandMode":true,"disableS3TableImport":false,"deploymentMode":"production","useSpotForWorkers":true,"removePasswordInAccountSettings":false,"preferStartTerminatedCluster":false,"enableUserInviteWorkflow":true,"enableStaticNotebooks":true,"enableCssTransitions":true,"serverlessEnableElasticDisk":true,"minClusterTagKeyLength":1,"showHomepageFeaturedLinks":true,"pricingURL":"https://databricks.com/product/pricing","enableClusterAclsConfig":false,"useTempS3UrlForTableUpload":false,"notifyLastLogin":false,"enableSshKeyUIByTier":true,"enableCreateClusterOnAttach":true,"defaultAutomatedPricePerDBU":0.2,"enableNotebookGitVersioning":true,"defaultMinWorkers":2,"files":"files/","feedbackEmail":"support@databricks.com","enableDriverLogsUI":true,"defaultMaxWorkers":8,"enableWorkspaceAclsConfig":true,"dropzoneMaxFileSize":2047,"enableNewClustersList":true,"enableNewDashboardViews":true,"driverLog4jFilePrefix":"log4j","enableSingleSignOn":true,"enableMavenLibraries":true,"displayRowLimit":1000,"deltaProcessingAsyncEnabled":true,"enableSparkEnvironmentVariablesUI":false,"defaultSparkVersion":{"key":"3.1.x-scala2.11","displayName":"3.1 (includes Apache Spark 2.2.0, Scala 2.11)","packageLabel":"spark-image-0e50da7a73d3bc76fcca8ac8a67c1e036c5a30deb5663adf6d0f332cc8ec2c90","upgradable":true,"deprecated":false,"customerVisible":true},"enableCustomSpotPricing":true,"enableMountAclsConfig":false,"defaultAutoterminationMin":180,"useDevTierHomePage":false,"enableClusterClone":true,"enableNotebookLineNumbers":true,"enablePublishHub":false,"notebookHubUrl":"http://hub.dev.databricks.com/","showSqlEndpoints":true,"enableNotebookDatasetInfoView":true,"enableClusterAclsByTier":true,"databricksDocsBaseUrl":"https://docs.databricks.com/","azurePortalLink":"https://portal.azure.com","cloud":"AWS","disallowAddingAdmins":false,"enableSparkConfUI":true,"featureTier":"UNKNOWN_TIER","mavenCentralSearchEndpoint":"http://search.maven.org/solrsearch/select","enableOrgSwitcherUI":false,"bitbucketCloudBaseApiV2Url":"https://api.bitbucket.org/2.0","clustersLimit":-1,"enableJdbcImport":true,"enableElasticDisk":true,"logfiles":"logfiles/","enableRelativeNotebookLinks":true,"enableMultiSelect":true,"enableWebappSharding":false,"enableClusterDeltaUpdates":true,"enableSingleSignOnLogin":false,"separateTableForJobClusters":true,"ebsVolumeSizeLimitGB":{"GENERAL_PURPOSE_SSD":[100,4096],"THROUGHPUT_OPTIMIZED_HDD":[500,4096]},"enableMountAcls":false,"requireEmailUserName":true,"dbcFeedbackURL":"http://feedback.databricks.com/forums/263785-product-feedback","enableMountAclService":true,"enableStructuredDataAcls":false,"serverlessClustersByDefault":false,"enableWorkspaceAcls":true,"maxClusterTagKeyLength":127,"gitHash":"3238d16bbc9a3a863776a0b907c2d56d80b5b8f6-dirty","showWorkspaceFeaturedLinks":true,"signupUrl":"https://databricks.com/try-databricks","serverlessAttachEbsVolumesByDefault":false,"enableTokensConfig":false,"allowFeedbackForumAccess":true,"enableImportFromUrl":true,"enableTokens":true,"enableMiniClusters":false,"enableNewJobList":true,"enableDebugUI":false,"enableStreamingMetricsDashboard":true,"allowNonAdminUsers":true,"enableSingleSignOnByTier":true,"enableJobsRetryOnTimeout":true,"useStandardTierUpgradeTooltips":true,"staticNotebookResourceUrl":"https://databricks-prod-cloudfront.cloud.databricks.com/static/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/","enableSpotClusterType":true,"enableSparkPackages":true,"dynamicSparkVersions":true,"enableClusterTagsUIByTier":true,"enableNotebookHistoryUI":true,"enableClusterLoggingUI":true,"enableDatabaseDropdownInTableUI":true,"showDebugCounters":false,"enableInstanceProfilesUI":true,"enableFolderHtmlExport":true,"homepageFeaturedLinks":[{"linkURI":"https://docs.databricks.com/_static/notebooks/gentle-introduction-to-apache-spark.html","displayName":"Introduction to Apache Spark on Databricks","icon":"img/home/Python_icon.svg"},{"linkURI":"https://docs.databricks.com/_static/notebooks/databricks-for-data-scientists.html","displayName":"Databricks for Data Scientists","icon":"img/home/Scala_icon.svg"},{"linkURI":"https://docs.databricks.com/_static/notebooks/structured-streaming-python.html","displayName":"Introduction to Structured Streaming","icon":"img/home/Python_icon.svg"}],"enableClusterStart":true,"enableEBSVolumesUIByTier":true,"singleSignOnComingSoon":false,"removeSubCommandCodeWhenExport":true,"upgradeURL":"","maxAutoterminationMinutes":10000,"autoterminateClustersByDefault":true,"notebookLoadingBackground":"#fff","sshContainerForwardedPort":2200,"enableServerAutoComplete":true,"enableStaticHtmlImport":true,"enableInstanceProfilesByTier":true,"showForgotPasswordLink":true,"defaultMemoryPerContainerMB":28000,"enablePresenceUI":true,"minAutoterminationMinutes":10,"accounts":true,"useOnDemandClustersByDefault":false,"useFramedStaticNotebooks":false,"enableNewProgressReportUI":true,"enableAutoCreateUserUI":true,"defaultCoresPerContainer":4,"showTerminationReason":true,"enableNewClustersGet":true,"showPricePerDBU":false,"showSqlProxyUI":true,"enableNotebookErrorHighlighting":true};</script>
<script>var __DATABRICKS_NOTEBOOK_MODEL = {"version":"NotebookV1","origId":1102511,"name":"Using SQL with a DataFrame","language":"scala","commands":[{"version":"CommandV1","origId":1561698,"guid":"7fd17285-8080-4cbd-b09c-40fdd7d49127","subtype":"command","commandType":"auto","position":1.078125,"command":"%md\n<div style=\"text-align: center; line-height: 0; padding-top: 9px;\">\n <img src=\"https://cdn2.hubspot.net/hubfs/438089/docs/training/dblearning-banner.png\" alt=\"Databricks Learning\" width=\"555\" height=\"64\">\n</div>\n","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":0,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"b16ddf2b-bbe9-44b5-94f3-73a67f48c014"},{"version":"CommandV1","origId":1561699,"guid":"485719ff-9f48-43aa-8bea-417de0847cfa","subtype":"command","commandType":"auto","position":1.15625,"command":"%md\n# Notes to Brian\nMark \"Read:\" once read. Or \"Agreed:\" or \"Let's Discuss:\" or any other tag.\n\n## Comments on notebook\n* We are going to have a major problem hyperlinking between notebooks in a portable way.\n * **(Doug)** This is a real challenge we'll have to address.\n * See https://databricks.atlassian.net/browse/FEATURE-1590\n * The workaround, for now, is to simply tell people where to find the next notebook or challenge exercise.\n* Updated the format for Prerequisites\n* (Rejected) Search for `val ssaNamesDF = 0 // <<FILL_IN>>` and decide if it's acceptable. It makes it possible to run_all.\n * (Brian) Why is \"run all\" a priority? You can always run the instructor notebook, which has only the answers. I don't see the value to \"run all\" in the student notebook. Is it because we want \"run all\" in the master notebook? Personally, I don't see that as a high priority.\n * **(Doug)** I agree. If we make it easy to compile and upload notebooks (that script we discussed) then I'm game for not making run-all a priority. It is nice though.\n* (Rejected) This notebook can now be detached and reattached and run_all to make sure all tests pass. But at the cost listed in the bullet above.\n * (Brian) Again, that can just be done with the instructor or answers notebook, too.\n * **(Doug)** Agreed. As long as we have nice tooling to ensure everything runs, I'm happy.\n* (Agreed) Notice the added learning objective `%sql`. Does it make sense to include in this notebook?\n* (Agreed) Notice the new Review section at the end that recaps the main take aways from the learning objectives.\n* If you can think of any further reading to add at the end (perhaps links to API doc and Databricks guide) let's prototype how that section should look.\n * **(Doug)** Can you think of any further reading for your lesson?\n * **(Brian)** Yes. Working on it.\n* For videos, let's use `// VIDEO \"URL\"`, and then we have flexability for how to mark it up later when they add `%html`. (Actually instead of `%html` we'll get `%md-unsantized`.)\n * (Brian) That requires tooling updates to remove the `//VIDEO \"URL\"` comment for now.\n * **(Doug)** Fair point. Which will take longer: editing a bunch of notebooks when we have a style change or editing the tool?\n * **(Doug)** You've expressed desire to redo the tool in Scala. Would that take a week? We can probably do it after our HP delivery.\n* (Agreed) At some point we'll need to decide how to do \"verbose\" version for eLearning and \"concise\" version for the teacher delivering instructor lead training.\n * (Brian) This comment needs to be recorded some place more permanent.\n * **(Doug)** Agreed. Want to create a doc somewhere in drive under /curriculum? If you create the doc then you'll know where to find it and can own it.\n* (Agreed) For challenge exercises, we'll include them as links at the end. Either in a \"Challenges\" section or in the \"Further Reading\" section. Challenges in such a case should actually be as real world related as possible.\n * (Brian) Maybe a better approach is to include it in the \"Further reading\" section, as a link to a separate \"Challenge\" notebook.\n * **(Doug)** I like that solution.\n*(Rejected) It would be nice if there was a way to add a table of contents to the top. This doesn't seem possible.\n * (Brian) `%html` would be nice, too. As I told Jacob today, though, I think content is a higher priority than slick presentation-related additions. It's theoretically possible to modify the build tool to generate a TOC, _but_ URL fragments (\\#header, for instance) don't seem to work very well in the notebook environment. (I tried it).\n * **(Doug)** Agreed, we're going to have to let go of this request, at least for now.\n* (Rejected) Notice at the end of the challenge Doug prototyped a different method of doing tests, using `str(...)=\"...\"` thinking it might be easier for less knowlegable people to understand. It may not actually be any better. Doug is curious what is Brian's opinion.\n * (Brian) My view of the asserts is that they're opaque. Students don't need to understand them. They just need to trust that they accurately test the solution. And using a `toString` approach worries me, as the string representation (of a Row, or any other data structure) can change, from release to release, without invalidating any working code. Testing a `toString` representation makes the notebooks more fragile.\n * **(Doug)** Agreed.\n* Made the tag at the top \"STANDARD_HEADER\" since that name is more generic. However it breaks the tool. So we need to decide if it really is better and worth updating the tool.\n * (Brian) I don't like `STANDARD_HEADER`. That sounds, to me, too much like, \"Oh, Databricks has a standard header?\" I prefer having the word \"training\" in there somewhere, hence `TRAINING_HEADING`\n * **(Doug)** What about when a MOOC or 3rd party university wants to include a header?\n * **(Doug)** I don't feel strongly either way. Just asking the question. You're right `STANDARD` isn't the right word. Maybe `MODULE_HEADER` or `NOTEBOOK_HEADER`?\n * **(Brian)** I prefer \"heading\" to \"header\", but a more generic term is probably better. `NOTEBOOK_HEADER` works for me. I can change the build tooling easily enough, once we agree on the term we want to use.\n* (Agreed) It would be nice to add a copyright notice to the NOTEBOOK_HEADER too. (C) Databricks, Inc 2017, all rights reserved.\n * (Brian) I'm fine with the copyright.\n \n## Comments on the video\n* We can ask Val to provide a standard \"title sequence\" at the start that labels it Databricks. I think we do that for Summit videos and Blog videos.\n * (Brian) That's fine.\n* Do we want in the video to show people where to find the notebook? Probably doesn't make sense because they're already in the notebook if watching the video. Although they could in theory be watching the video from a webpage too.\n * (Brian) If we host the embed the video _outside_ the notebook, the URL should be in that page. That's how the MOOCs worked.\n * **(Doug)** Agreed.\n* When recording video, set the computer screen resolution to 1024x640. Not everyone has a high res monitor (india, etc) and it can be hard to see even if full-screen. Full screen is possible once they add `%md-unsantized`. We could alternatively include a hyperlink for people to watch the video in another tab instead of embedding it. I also noticed that even when I make the window bigger, it doesn't actually make the video bigger.\n * (Brian) This should be recorded somewhere.\n * **(Doug)** Agreed. As style owner, can you create a style guide?\n* Apparently Vimeo has features for tracking users that Youtube doesn't have. Let's plan to check that out during phase 2 of the project.\n* **(Doug)** In the video and lesson we make reference to the DataFrames api before the sql API. We should discuss which we want to teach first... SQL or Dataframe query API.\n* **(Doug)** And this raises a really important point... we need to agree and document really clearly what can be assumed going into a lesson if we're going to create these lessons in parallel.\n* **(Doug)** In the video we also show how to createTable and why it is bad. Let's discuss as a style issue how much tangental stuff we want to include. I'm on the fence.\n\n## Comments on notebook style\n* Style guide (Brian owns style, Doug is just making suggestions that Brian can reject.)\n * For markdown cells use `--` for comments. [Looks better than `##` and `//` (`##` is a markdown macro).] **(Brian) This is inconsistent with most of our other notebooks. Personally, I find them _both_ ugly, and I have no real preference.**\n * For `// TODO` include a space between // and TODO\n * Have `%scala` at the top of scala only cells, even if it's the notebook type.\n * Put `<<FILL_IN>>` in comments to keep the notebook runnnable. **(Brian) I'm not sure how practical this is. `x = # <<FILL_IN>>` is still not runnable.**\n * Alternate `%scala` and `%python` cells, so that each cell has 2 versions side by side.\n * Format code in `%md` with `code`\n * For asserts in Scala, prefer toList to toSeq as it's easier for lay people to understand. Performance difference is minor. Arrays exist under the hood either way once you do a collect() and an explicitly defined Seq.\n* Later, for contractors, we can provide a sample lesson notebook that teaches the style best practices. For now let's just keep a doc somewhere that is our style guide.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106947478,"finishTime":0,"collapsed":true,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"32a21682-da64-4700-a825-f1b00882170a"},{"version":"CommandV1","origId":1561700,"guid":"57a41419-0045-4aee-a4be-e203e1170e25","subtype":"command","commandType":"auto","position":1.3125,"command":"%md\n# Using SQL with a DataFrame\n\nApache Spark allows you to use SQL or the programmatic DataFrame API to construct your queries.\n\n## In this lesson you:\n* Query an existing DataFrame using Spark SQL.\n* Use the Databricks %sql feature.\n\n## Audience\n* General audiences, including Analysts, Engineers, and Data Scientists\n\n## Prerequisites\n* Lesson: Introduction to Scala or Python [link] (Let's point an online tutorial instead.)\n* Lesson: Introduction to Dataframes [link]\n* Concept: The Databricks built-in `display(...)` function [link]\n* Concept: The Spark `dataframe.printSchema()` function [link]","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106947557,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"81464194-e1e2-4f66-b1e9-1f821e762790"},{"version":"CommandV1","origId":1561701,"guid":"5143ddaa-6e53-4072-9993-aa09dc4957fb","subtype":"command","commandType":"auto","position":1.375,"command":"%md Please run the following cell by placing your cursor inside the cell and typing Ctrl-Enter or Shift-Enter. Doing so will display an embedded video. Watch the video and follow along in this notebook.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106947621,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"1af790bd-ec43-4b6c-bd46-2316ffa52c54"},{"version":"CommandV1","origId":1561702,"guid":"11abcd6a-9833-43dc-8bbc-be1c20b4fb7b","subtype":"command","commandType":"auto","position":1.390625,"command":"%md\n## Lesson\nWatch the following video.<br/></br>\n\n<a href=\"https://www.youtube-nocookie.com/embed/836_VC_14rQ?rel=0\" target=\"video\" style=\"display: flex; flex-flow: column; justify-content: center; align-items: center; width: 432px; height: 243px; background: black; color: white; text-decoration: none; font-weight: bold; border-radius: 2px;\">\n <img src=\"https://cdn2.hubspot.net/hubfs/438089/docs/training/notebook-dblearning-thumbnail.png\" alt=\"Play Video\" width=\"290\" height=\"92\" style=\"margin-bottom: 8px;\">\n Using SQL with Dataframes\n</a>\n","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106947689,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"705189a2-af39-4e31-9421-f870b7848eee"},{"version":"CommandV1","origId":1561703,"guid":"8c0bf59f-6520-44f0-98cc-716ff9bb0ce5","subtype":"command","commandType":"auto","position":3.0,"command":"%scala\nval peopleDF = spark.read.parquet(\"dbfs:/mnt/training/dataframes/people-10m.parquet\")","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106947789,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"25e506f3-ee95-4192-bac0-57c2905fdff8"},{"version":"CommandV1","origId":1561704,"guid":"181b3b8b-9361-460e-a6e5-15462217ab10","subtype":"command","commandType":"auto","position":3.5,"command":"","commandVersion":0,"state":"error","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":0,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"274f394d-c12b-444e-84bc-544200f55f00"},{"version":"CommandV1","origId":1561705,"guid":"a87bf1f8-1ee1-41d4-96cb-1d777d9ebf6b","subtype":"command","commandType":"auto","position":4.0,"command":"%python\npeople_df = spark.read.parquet(\"dbfs:/mnt/training/dataframes/people-10m.parquet\")","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106947832,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"c522f7d5-82f5-4d49-9fb7-a9e0a17f6261"},{"version":"CommandV1","origId":1561706,"guid":"97087333-5921-4628-84d6-ce1f9804f2fe","subtype":"command","commandType":"auto","position":5.0,"command":"%md\nWe can easily query this DataFrame using the programmatic DataFrame API. For instance, the following query selects people were born in 1990 and whose first names start with \"A\".","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106947884,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"73141d85-9716-4af8-8ad0-8da0883c4f6a"},{"version":"CommandV1","origId":1561707,"guid":"b9bb9b92-7bbd-4601-8c48-c27d1868a3b8","subtype":"command","commandType":"auto","position":6.0,"command":"%scala\nimport org.apache.spark.sql.functions._\n\ndisplay(\n peopleDF\n .select($\"firstName\", $\"lastName\", $\"birthDate\")\n .filter(year($\"birthDate\") === 1990)\n .filter($\"firstName\" like \"A%\")\n .orderBy(\"firstName\", \"lastName\")\n)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498511482438,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"b362e03d-2a2a-4143-b35d-f49ba5ad45a8"},{"version":"CommandV1","origId":1561708,"guid":"4a175c43-ca59-40a0-88f0-b4583f73be28","subtype":"command","commandType":"auto","position":6.125,"command":"%python\nfrom pyspark.sql.functions import *\n\ndisplay(\n people_df\n .select(col(\"firstName\"), col(\"lastName\"), col(\"birthDate\"))\n .filter(year(col(\"birthDate\")) == 1990)\n .filter(col(\"firstName\").like(\"A\"))\n .orderBy(\"firstName\", \"lastName\")\n)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106947993,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"5d7468d1-786f-4565-a9ee-8dfd912fa19a"},{"version":"CommandV1","origId":1561709,"guid":"14ed1273-0e10-4b63-af68-805e3694fb93","subtype":"command","commandType":"auto","position":6.25,"command":"%md\nBut, what if we want to issue the same query in SQL?\n\nFirst, we need to give the DataFrame a table name. We _could_ simply save the DataFrame as a Hive table, using\n`df.write().saveAsTable(\"people\")`, but doing so incurs an output penalty: It write the _entire_ DataFrame to the\ndistributed file system.\n\nWouldn't it be easier if we could just query the _in-memory_ DataFrame as if it were a table?\n\nWe can do exactly that, using `createOrReplaceTempView()`.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106948031,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"0d6e0e10-5388-42cc-9cf6-53d7191b1dfd"},{"version":"CommandV1","origId":1561710,"guid":"19216457-bd5d-486e-b023-19f543c7331c","subtype":"command","commandType":"auto","position":6.5,"command":"%scala\npeopleDF.createOrReplaceTempView(\"people\")","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106948104,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"378ab06d-648c-45d6-bcb7-9e3002551a1b"},{"version":"CommandV1","origId":1561711,"guid":"ebc57724-2c3b-4894-b979-a00c3417e01f","subtype":"command","commandType":"auto","position":6.75,"command":"%python\npeople_df.createOrReplaceTempView(\"people\")","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106948137,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"919e7230-bcbd-4403-b766-210fa36e071d"},{"version":"CommandV1","origId":1561712,"guid":"672b6808-913d-49ee-a9e1-b1c943bb9721","subtype":"command","commandType":"auto","position":6.875,"command":"%md\nIf we just want to display the output of our query, we can use the special Databricks `%sql` cell:","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106948171,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"8940a799-2b9e-434d-a4e4-24f036e21465"},{"version":"CommandV1","origId":1561713,"guid":"4985372b-770a-49b4-aa7b-07763fc725a8","subtype":"command","commandType":"auto","position":7.0,"command":"%sql SELECT firstName, lastName, birthDate FROM people WHERE year(birthDate) = 1990 AND firstName LIKE 'A%' ORDER BY firstName, lastName\n-- ALL_NOTEBOOKS","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106948235,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"db2a001f-7b95-46b8-96bc-4218ba7660d3"},{"version":"CommandV1","origId":1561714,"guid":"52d398c7-1de1-4568-aac0-eb8285d49ec8","subtype":"command","commandType":"auto","position":8.0,"command":"%md\nThe temporary view is just a table alias, a _name_ that refers to the in-memory DataFrame. It is not backed by a real, on-disk table. The name lasts until the SparkSession goes away.\n\nWe can also save the results of our query in a _new_ DataFrame, in case we can to do additional queries against it.\n","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106948298,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"96fd5393-f4de-4a78-a427-18f763b55d72"},{"version":"CommandV1","origId":1561715,"guid":"8bac6c97-384f-4ed9-bb25-62c41c7c5043","subtype":"command","commandType":"auto","position":9.0,"command":"%scala\nval peopleDF2 = spark.sql(\"SELECT firstName, lastName, birthDate FROM people WHERE year(birthDate) = 1990 AND firstName LIKE 'A%' ORDER BY firstName, lastName\")","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106948361,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"5ec09d4d-36ba-419c-90e7-dcdd6d70230b"},{"version":"CommandV1","origId":1561716,"guid":"d043f2f2-611d-49af-ad30-eb33d2b1be2a","subtype":"command","commandType":"auto","position":10.0,"command":"%scala\ndisplay(\n peopleDF2.filter($\"lastName\" like \"Z%\")\n)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106948411,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"1cc2c1a7-30cc-495b-abd2-79e3178b3a51"},{"version":"CommandV1","origId":1561717,"guid":"a7cbf3f9-ac12-4822-854b-49b36027eff0","subtype":"command","commandType":"auto","position":11.0,"command":"%python\npeople_df2 = spark.sql(\"SELECT firstName, lastName, birthDate FROM people WHERE year(birthDate) = 1990 AND firstName LIKE 'A%' ORDER BY firstName, lastName\")","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106948444,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"f3615ce2-3820-4aae-a99f-b2ac738b4a28"},{"version":"CommandV1","origId":1561718,"guid":"2531808d-a266-4584-bcb8-d772a706c836","subtype":"command","commandType":"auto","position":12.0,"command":"%python\ndisplay(\n people_df2.filter(col(\"lastName\").like(\"Z%\"))\n)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106948485,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"ef3904cb-f846-4835-ac16-4a0daa468908"},{"version":"CommandV1","origId":1561719,"guid":"75316432-57c9-43fb-a675-f0bf26083761","subtype":"command","commandType":"auto","position":14.5,"command":"%md\n<a href=\"foo\"></a>\n## Exercise 1\n\nGiven the following DataFrame, use SQL to find all zip codes for Santa Monica, California. Store the result in a DataFrame called `smzips`. The DataFrame should contain only the `zipCode` column.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106948517,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"d092bc23-ad1f-4047-be71-716b4e8f129b"},{"version":"CommandV1","origId":1561720,"guid":"ad615478-cbb6-4cb1-a3a7-851ac64baa81","subtype":"command","commandType":"auto","position":15.0,"command":"%scala\nval zips = spark.read.json(\"dbfs:/mnt/training/zips.json\").withColumnRenamed(\"_id\", \"zipCode\")","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106948575,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"63dbc66e-cbf7-4107-a187-8afcd1b90b58"},{"version":"CommandV1","origId":1561721,"guid":"08a2fc1b-b7bd-4634-8326-55a41790b943","subtype":"command","commandType":"auto","position":15.25,"command":"%python\nzips = spark.read.json(\"dbfs:/mnt/training/zips.json\").withColumnRenamed(\"_id\", \"zipCode\")","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106948651,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"944328c6-2d9c-458d-a944-993ef566bda4"},{"version":"CommandV1","origId":1561722,"guid":"283b7c1e-0fe5-4823-95fd-45545a64b906","subtype":"command","commandType":"auto","position":15.5,"command":"%md\n-- SCALA_ONLY\n\n**Hints**\n\n* You'll need to examine the data first.\n* You'll need to use some of the functions in [org.apache.spark.sql.functions](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106948685,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"e944c92d-833f-4c38-8ad8-a25e170c8afb"},{"version":"CommandV1","origId":1561723,"guid":"bc8d9318-7d20-4905-b1f6-6a2862dbcf7b","subtype":"command","commandType":"auto","position":15.75,"command":"%md\n-- PYTHON_ONLY\n\n\n**Hints**\n\n* You'll need to examine the data first.\n* You'll need to use some of the functions in [pyspark.sql.functions](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106948756,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"a0dbe873-6cc2-4a92-8506-b33bec13ee75"},{"version":"CommandV1","origId":1561724,"guid":"ebd0c959-90ae-49a1-94a3-bd18af673840","subtype":"command","commandType":"auto","position":16.375,"command":"%scala\n// TODO\n//\n// Replace <<FILL_IN>> with your code.\nzips.<<FILL_IN>>\nval smzips = <<FILL_IN>>","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106948831,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"79993e1d-af8e-482e-aa3c-dc58beaa4cc5"},{"version":"CommandV1","origId":1561725,"guid":"907c37c7-e21e-404c-9537-e054212fd467","subtype":"command","commandType":"auto","position":16.6875,"command":"%python\n# TODO\n#\n# Replace <<FILL_IN>> with your code.\nzips.<<FILL_IN>>\nsmzips = <<FILL_IN>>","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106948878,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"d26870d3-1e47-4ab8-864c-085cc1195859"},{"version":"CommandV1","origId":1561726,"guid":"cd484144-9aea-4cc5-8c11-184c6f74f983","subtype":"command","commandType":"auto","position":17.0,"command":"%scala\n// ANSWER\nzips.createOrReplaceTempView(\"zips\")\nval smzips = spark.sql(\"\"\"SELECT zipCode FROM zips WHERE state = 'CA' and city = 'SANTA MONICA'\"\"\")","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106948922,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"c61c697d-52bd-49a3-8839-b6d5be249f25"},{"version":"CommandV1","origId":1561727,"guid":"3fcca3be-fe40-4b81-be6e-889aeeedb553","subtype":"command","commandType":"auto","position":17.25,"command":"%python\n# ANSWER\nzips.createOrReplaceTempView(\"zips\")\nsmzips = spark.sql(\"\"\"SELECT zipCode FROM zips WHERE state = 'CA' and city = 'SANTA MONICA'\"\"\")","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106948955,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"7efb8dbe-94e0-4d2b-ae06-9c8f464da81c"},{"version":"CommandV1","origId":1561728,"guid":"2b3f0381-1fee-4fab-a00c-e13bff3d793c","subtype":"command","commandType":"auto","position":17.5,"command":"%md\nRun the following cell to test your answer.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106948987,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"22b17fab-19c6-4147-8902-6b498dbba91b"},{"version":"CommandV1","origId":1561729,"guid":"6bfc054b-f7aa-42a6-a704-e808eb3365f3","subtype":"command","commandType":"auto","position":18.0,"command":"%scala\n// TEST\n// Run this test to verify your are on the right track.\nassert(smzips.as[String].collect().toSet == Set(\"90401\", \"90402\", \"90403\", \"90404\", \"90405\"))","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106949048,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"d34fc742-e0c8-41d5-bced-096ddf8ed9b4"},{"version":"CommandV1","origId":1561730,"guid":"d997202e-1579-4b19-8d96-68efd9fec019","subtype":"command","commandType":"auto","position":19.0,"command":"%python\n# TEST\n# Run this test to verify your are on the right track.\n%%\n\nassert(set([row.zipCode for row in smzips.collect()]) == {u\"90401\", u\"90402\", u\"90403\", u\"90404\", u\"90405\"})","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106949094,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"354be1fc-7df2-4e72-8fc6-6ac72d15b8d6"},{"version":"CommandV1","origId":1561731,"guid":"ea933d2e-cc4f-4fdc-b44a-88cee0d74a52","subtype":"command","commandType":"auto","position":19.5,"command":"%md\n## Exercise 2\n\nLoad the following file into DataFrame. The file consists of data from the U.S. Social Security Administration. Each row contains:\n\n* a first name\n* a birth year\n* a count of the number of babies born in that year with that first name\n\nOnce you've loaded it into a DataFrame, use SQL (not the DataFrame API) to find the 5 most popular girl's first names in 1885.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106949135,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"72374a69-7c0c-46c3-ab72-ef4e9d369662"},{"version":"CommandV1","origId":1561732,"guid":"b999bff2-f19b-4da4-9e73-3b7f504e0034","subtype":"command","commandType":"auto","position":19.5625,"command":"%scala\n// TODO\nval ssaNamesFile = \"dbfs:/mnt/training/ssn/names.parquet\"\nval ssaNamesDF = 0 // <<FILL_IN>>","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106949272,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"aa50cc17-498c-4cda-897f-176f7180e69e"},{"version":"CommandV1","origId":1561733,"guid":"e91ea234-2eac-4977-86a1-ae135e2fb544","subtype":"command","commandType":"auto","position":19.5634765625,"command":"%python\n# TODO\nssaNamesFile = \"dbfs:/mnt/training/ssn/names.parquet\"\nssaNamesDF = 0 # <<FILL_IN>>","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106949318,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"a5300039-193a-4dcb-b2fb-efb970fbeaf2"},{"version":"CommandV1","origId":1561734,"guid":"b8f0bc93-2e81-44a7-934a-11187996a690","subtype":"command","commandType":"auto","position":19.56640625,"command":"%scala\n// ANSWER\nval ssaNamesFile = \"dbfs:/mnt/training/ssn/names.parquet\"\nval ssaNamesDF = spark.read.parquet(ssaNamesFile)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106949362,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"f9e3cc19-1ef5-418f-ace8-d4a726398f0c"},{"version":"CommandV1","origId":1561735,"guid":"bfe84bd8-3514-4140-9ac9-15cc5b0abc04","subtype":"command","commandType":"auto","position":19.56787109375,"command":"%python\n# ANSWER\nssaNamesFile = \"dbfs:/mnt/training/ssn/names.parquet\"\nssaNamesDF = spark.read.parquet(ssaNamesFile)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106949394,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"33dc6a87-7093-4c01-a596-9290ee23f067"},{"version":"CommandV1","origId":1561736,"guid":"454aa3a6-05e9-498d-ab6f-6bb4a93b5fd6","subtype":"command","commandType":"auto","position":19.572265625,"command":"%scala\n// TEST\n// Run this test to verify your are on the right track.\nassert(\n ssaNamesDF.as[(String, String, Int, Int)].limit(5).collect.toList ==\n List((\"Jennifer\",\"F\",54336,1983), (\"Jessica\",\"F\",45278,1983), (\"Amanda\",\"F\",33752,1983), (\"Ashley\",\"F\",33292,1983), (\"Sarah\",\"F\",27228,1983))\n)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106949425,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"75371cb8-2759-4b83-a11e-131bb92e0e07"},{"version":"CommandV1","origId":1561737,"guid":"bdf9cf42-67fd-4051-a057-0355b370e2c5","subtype":"command","commandType":"auto","position":19.59375,"command":"%python\n# TEST\n# Run this test to verify your are on the right track.\nassert(\n [tuple(row) for row in ssaNamesDF.limit(5).collect()] == \n [(u'Jennifer', u'F', 54336, 1983), (u'Jessica', u'F', 45278, 1983), (u'Amanda', u'F', 33752, 1983), (u'Ashley', u'F', 33292, 1983), (u'Sarah', u'F', 27228, 1983)]\n)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106949471,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"b18f49f4-70c9-4185-966d-73f03bc60b1b"},{"version":"CommandV1","origId":1561738,"guid":"cbf9e9e3-25c7-4e31-8eb6-1b274f23cf61","subtype":"command","commandType":"auto","position":19.6015625,"command":"%md\n**Step 2a:** Use the built-in Databricks `display(...)` function to view and explore `ssaNamesDF`. ","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106949513,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"e8d761b2-0afe-403a-a16f-2ddc1af56498"},{"version":"CommandV1","origId":1561739,"guid":"ec589939-0d40-44c7-b709-1e86471f6654","subtype":"command","commandType":"auto","position":19.603515625,"command":"%scala\n// TODO","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106949587,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"e7e45f85-6cca-494f-a8b3-4464a2f918d2"},{"version":"CommandV1","origId":1561740,"guid":"52d53e21-a51f-4d9e-a9a9-3b9c16822f72","subtype":"command","commandType":"auto","position":19.60400390625,"command":"%python\n# TODO","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106949629,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"754e1a9c-6c26-4588-8c4c-4c95d2fc9ea1"},{"version":"CommandV1","origId":1561741,"guid":"1600ac95-f224-40e2-9898-6b9f8888197f","subtype":"command","commandType":"auto","position":19.60546875,"command":"%scala\n// ANSWER\ndisplay(ssaNamesDF)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106949680,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"4e835e70-e015-447a-88ee-e874c8b5a21f"},{"version":"CommandV1","origId":1561742,"guid":"0ebed235-9a81-4d0c-b5b2-1945be3f5818","subtype":"command","commandType":"auto","position":19.607421875,"command":"%python\n# ANSWER\ndisplay(ssaNamesDF)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106949736,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"291f0b44-de5d-4313-9920-a24f6b8fc9c7"},{"version":"CommandV1","origId":1561743,"guid":"07995cd8-f72c-420f-823a-2e9b8e09b820","subtype":"command","commandType":"auto","position":19.60888671875,"command":"%md \n**Step 2b:** Print the schema of `ssaNamesDF`. \n*Hint:* Since this is our first time viewing a schema, try: `ssaNamesDF.printSchema()`","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106949791,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"62c8fad8-a09f-41ea-8655-9e773439ce8e"},{"version":"CommandV1","origId":1561744,"guid":"af531590-3c02-4b19-851e-c3bca63c69e3","subtype":"command","commandType":"auto","position":19.60894775390625,"command":"%scala\n// TODO","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106949921,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"931a868f-1d3f-49f2-b66c-86004a4a1828"},{"version":"CommandV1","origId":1561745,"guid":"87862682-2f4a-4e2a-a9ad-f5bd54fb576b","subtype":"command","commandType":"auto","position":19.608978271484375,"command":"%python\n# TODO","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106949969,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"fd0bffd6-54b4-4795-b705-9b6288c1f43a"},{"version":"CommandV1","origId":1561746,"guid":"ba15e192-313f-421d-b15a-e2cd175cf8da","subtype":"command","commandType":"auto","position":19.6090087890625,"command":"%scala\n// ANSWER\nssaNamesDF.printSchema","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106950008,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"5756ba0f-99db-4767-b3fc-ca3e4057f968"},{"version":"CommandV1","origId":1561747,"guid":"1668c9a0-abb3-4968-802c-4f6c5e08a73d","subtype":"command","commandType":"auto","position":19.609130859375,"command":"%python\n# ANSWER\nssaNamesDF.printSchema()","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106950045,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"72e14529-366d-49de-95ad-87ce91ad6d98"},{"version":"CommandV1","origId":1561748,"guid":"8c76fd82-85c6-4728-9557-2f1b32fb451d","subtype":"command","commandType":"auto","position":19.609375,"command":"%md \n**Step 3:** Create a temp view named `ssa_names` so you can query the DataFrame using SQL.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106950077,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"f8db21a3-56a4-4354-8636-c2bff0d65274"},{"version":"CommandV1","origId":1561749,"guid":"278cd0cd-1a92-4127-9ef4-a1bd63575994","subtype":"command","commandType":"auto","position":19.61328125,"command":"%scala\n// TODO","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106950137,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"a5edbcc8-bdae-4295-95c0-9dd2bb6b4b97"},{"version":"CommandV1","origId":1561750,"guid":"36ef7194-0621-480a-b64d-f54776664120","subtype":"command","commandType":"auto","position":19.61376953125,"command":"%python\n# TODO","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106950314,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"20c3bf46-4cac-4c01-b797-c73e77c9a3c8"},{"version":"CommandV1","origId":1561751,"guid":"1e4d9056-e146-4597-82f6-7676822a2d55","subtype":"command","commandType":"auto","position":19.615234375,"command":"%scala\n// ANSWER\nssaNamesDF.createOrReplaceTempView(\"ssa_names\")","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106950348,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"d48ebcbf-6152-4fab-bd59-6e9828d38597"},{"version":"CommandV1","origId":1561752,"guid":"e93cd0e5-a653-425f-9f5d-43a8219250e9","subtype":"command","commandType":"auto","position":19.61572265625,"command":"%python\n# ANSWER\nssaNamesDF.createOrReplaceTempView(\"ssa_names\")","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106950389,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"a2435168-fbe0-440a-9c61-0a553ed615f7"},{"version":"CommandV1","origId":1561753,"guid":"5cb30c62-49dc-48ba-88f3-a308ecbf33f1","subtype":"command","commandType":"auto","position":19.6171875,"command":"%scala\n// TEST\n// Run this test to verify your are on the right track.\nassert(\n spark.sql(\"SELECT * FROM ssa_names LIMIT 5\").as[(String, String, Int, Int)].collect.toList ==\n List((\"Jennifer\",\"F\",54336,1983), (\"Jessica\",\"F\",45278,1983), (\"Amanda\",\"F\",33752,1983), (\"Ashley\",\"F\",33292,1983), (\"Sarah\",\"F\",27228,1983))\n)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106950426,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"58dd7c9c-7287-4453-a6ce-f6100e86236f"},{"version":"CommandV1","origId":1561754,"guid":"a65b6b82-dace-4b97-aed9-0d0d564a04a4","subtype":"command","commandType":"auto","position":19.618408203125,"command":"%python\n# TEST\n# Run this test to verify your are on the right track.\nassert(\n [tuple(row) for row in spark.sql(\"SELECT * FROM ssa_names LIMIT 5\").limit(5).collect()] ==\n [(u'Jennifer', u'F', 54336, 1983), (u'Jessica', u'F', 45278, 1983), (u'Amanda', u'F', 33752, 1983), (u'Ashley', u'F', 33292, 1983), (u'Sarah', u'F', 27228, 1983)]\n)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106950460,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"972f4f44-2a4c-4e00-a178-da2df1fb2920"},{"version":"CommandV1","origId":1561755,"guid":"78f79964-b8ec-48c2-b2d9-aaa65762b733","subtype":"command","commandType":"auto","position":19.6220703125,"command":"%md\n**Step 4a:** Use SQL to find the 5 most popular girl's first names in 1990, and save the results as a DataFrame named `popularDF`.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106950501,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"52de124e-e5f7-4e6c-9774-19a765970ff3"},{"version":"CommandV1","origId":1561756,"guid":"356f5aaa-5537-41f6-a5f8-af0a96e333e0","subtype":"command","commandType":"auto","position":19.75,"command":"%scala\n// TODO\nval popularDF = 0 // <<FILL_IN>>","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106950561,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"024a7767-8869-4d42-9901-aecd330afec4"},{"version":"CommandV1","origId":1561757,"guid":"3836f143-1434-44dc-8d6d-d7dc197b2375","subtype":"command","commandType":"auto","position":19.875,"command":"%python\n# TODO\npopularDF = 0 # <<FILL_IN>>","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106950607,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"6765195e-1b33-4a16-b90b-334cba01d055"},{"version":"CommandV1","origId":1561758,"guid":"07cd2538-40bc-4ae5-8b19-449887ffdfc6","subtype":"command","commandType":"auto","position":23.5,"command":"%scala\n// ANSWER\nval sql = \"SELECT firstName, total FROM ssa_names WHERE year = 1990 ORDER BY year DESC LIMIT 5\"\nval popularDF = spark.sql(sql)\n","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106950638,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"df1eaf2e-63c6-4809-86ba-4609abfa194d"},{"version":"CommandV1","origId":1561759,"guid":"d4fa4e67-b772-459c-b5cb-227a76bd73bc","subtype":"command","commandType":"auto","position":24.5,"command":"%python\n# ANSWER\nsql = \"SELECT firstName, total FROM ssa_names WHERE year = 1990 ORDER BY year DESC LIMIT 5\"\npopularDF = spark.sql(sql)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106950685,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"7207d623-af62-4cc8-bd52-65a522405afa"},{"version":"CommandV1","origId":1561760,"guid":"2e3eb47c-3406-4bb3-8258-5a8a95315ba6","subtype":"command","commandType":"auto","position":25.5,"command":"%scala\n// Run this test case to verify your answer to the exercise above is correct.\nassert(\n popularDF.as[(String, Int)].collect().toSet ==\n Set((\"Jessica\", 46466), (\"Samantha\", 25864), (\"Ashley\" ,45549), (\"Brittany\", 36535), (\"Amanda\", 34406))\n)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106950742,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"3551e439-dc8c-48fa-b6e0-5a77810aeddc"},{"version":"CommandV1","origId":1561761,"guid":"3e67c2e2-3f6e-4886-bc7d-62fc8a0723c1","subtype":"command","commandType":"auto","position":26.5,"command":"%python\n# Run this test case to verify your answer to the exercise above is correct.\nassert(\n set([(row.firstName, row.total) for row in popularDF.collect()]) == \n set([(\"Jessica\", 46466), (\"Samantha\", 25864), (\"Ashley\" ,45549), (\"Brittany\", 36535), (\"Amanda\", 34406)])\n)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106950786,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"ca16dc7e-20f3-4649-8d93-380b6801914b"},{"version":"CommandV1","origId":1561762,"guid":"f23ac366-06f6-41b0-948e-7d860534bce7","subtype":"command","commandType":"auto","position":27.0,"command":"%md\n**Step 4b**: Use Databricks' `display(...)` function to view the contents of `popularDF`.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106950830,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"28a0a7c6-22b7-49d2-9d8d-8aa7167583f6"},{"version":"CommandV1","origId":1561763,"guid":"2bfeec6e-b1d7-4639-b770-f7abaaaf8f0b","subtype":"command","commandType":"auto","position":27.0625,"command":"%scala\n// TODO","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106950898,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"9186425b-d98a-4c55-a34c-9eb8e8eca730"},{"version":"CommandV1","origId":1561764,"guid":"80749463-077d-4eee-86bd-22cc7e98dd6a","subtype":"command","commandType":"auto","position":27.125,"command":"%python\n# TODO","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106950937,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"0612fa67-55cf-4766-af4e-6107372e2acb"},{"version":"CommandV1","origId":1561765,"guid":"ef9ea595-35c3-4492-aeef-81e8e94e463e","subtype":"command","commandType":"auto","position":27.25,"command":"%scala\n// ANSWER\ndisplay(popularDF)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106950981,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"d1a631f0-9fbc-4c07-a533-820c9013fe18"},{"version":"CommandV1","origId":1561766,"guid":"207044aa-91ad-487e-9959-019129dae246","subtype":"command","commandType":"auto","position":27.375,"command":"%python\n# ANSWER\ndisplay(popularDF)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106951041,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"b103a43c-55dd-4640-9852-95c997ba92fe"},{"version":"CommandV1","origId":1561767,"guid":"a16d697b-a261-4b8d-8366-f518d608d528","subtype":"command","commandType":"auto","position":27.40625,"command":"%md\n<a name=\"Exercise2.5\"></a>\n**Step 5:** Run the same SQL statement as above, but this time use the %sql cell type in Databricks","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106951077,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"acc0b80b-ca9a-48c8-b475-62cceaca3694"},{"version":"CommandV1","origId":1561768,"guid":"6bf49c80-7744-4cf8-8ac7-f9291efd38ec","subtype":"command","commandType":"auto","position":27.421875,"command":"%sql\n-- ALL_NOTEBOOKS\n-- TODO","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106951146,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"37c356ac-75ff-4ef5-a9b2-f795c93570a5"},{"version":"CommandV1","origId":1561769,"guid":"2937e10d-f574-45d2-b7b5-6560c563094b","subtype":"command","commandType":"auto","position":27.4296875,"command":"%sql\n-- ALL_NOTEBOOKS\n-- ANSWER\nSELECT firstName, total FROM ssa_names WHERE year = 1990 ORDER BY year DESC LIMIT 5","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106951183,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"afedd1eb-3211-485f-a973-d9030fa9492b"},{"version":"CommandV1","origId":1561770,"guid":"810a84f3-d261-4467-9bd6-feffa183f6f0","subtype":"command","commandType":"auto","position":27.4375,"command":"%md\n## Review\n**Q:** How do you make a dataframe accessible via SQL? \n**A:** `myDataFrame.createOrReplaceTempView(\"my_table\")`\n\n**Q:** How do you execute SQL within python or scala code? \n**A:** `spark.sql(\"SELECT * FROM my_table\")`\n\n**Q:** What's an easy way to execute SQL from within the Databricks environment? \n**A:** `%sql SELECT * FROM my_table`","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106951263,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"61188a7d-8817-4dd9-a54c-f13c5708e204"},{"version":"CommandV1","origId":1561771,"guid":"d160a58b-6851-4d4d-a2c9-dba9315ea90d","subtype":"command","commandType":"auto","position":27.5,"command":"%md\n## Next Steps\n* Link to Next Lesson (if multiple courses share a notebook, we'd have multiple choices here)\n* Link to Challenge Exercise","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106951323,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"54439d41-5c7b-4cc3-9ae6-147e19f47906"},{"version":"CommandV1","origId":1561772,"guid":"276e2fa7-df4c-4dc3-ac37-d19bf8828e6b","subtype":"command","commandType":"auto","position":28.5,"command":"%md \n## Additional Resources\n* Q: Where can I read more about temp views... \n A: Link...","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106951382,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"b9badfc7-fb94-43f3-bbfa-c19216efcf7c"},{"version":"CommandV1","origId":1561773,"guid":"e1f36def-1a88-4d87-963f-4e2ce89d7cbf","subtype":"command","commandType":"auto","position":29.5,"command":"%md\n## Challenge 1\n\nRepeat exercise 2, but this time find the most popular girl's first name in 1885, 1915, 1945, 1975, and 2005 using a single query, sorted by ascending year. Store the result in `historicNamesDF` and also display it. You may need to introduce additional variables along the way. No hints will be provided. If you recently completed exercse 2 above, you are free to assume `ssa_names` is already registered as a SQL view.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106951456,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"010822d9-b7e5-43ab-b0ca-55294fda52a3"},{"version":"CommandV1","origId":1561774,"guid":"b61c452a-703c-4ee4-a1f7-eb65cad562fe","subtype":"command","commandType":"auto","position":30.0,"command":"%scala\n// TODO\nval ssaNamesFile = \"dbfs:/mnt/training/ssn/names.parquet\"\nval historicNamesDF = 0 // <<FILL_IN>>","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106951509,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"66feb5bd-76c9-45e3-acd2-27392b1c1ace"},{"version":"CommandV1","origId":1561775,"guid":"8bde0933-ed50-46a8-9145-deb8dcaa1bf3","subtype":"command","commandType":"auto","position":30.5,"command":"%python\n# TODO\nssaNamesFile = \"dbfs:/mnt/training/ssn/names.parquet\"\nhistoricNamesDF = 0 # <<FILL_IN>>","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106951544,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"336d4415-e70f-4da8-9f2f-3346ae5028cd"},{"version":"CommandV1","origId":1561776,"guid":"9b0cd01c-3f2a-4402-9dc8-a6729654fe9d","subtype":"command","commandType":"auto","position":31.0,"command":"%scala\n// ANSWER\nval historicNamesSql = \"\"\"\n SELECT firstName, year, total\n FROM ssa_names\n NATURAL INNER JOIN (\n SELECT year, gender, MAX(total) AS total\n FROM ssa_names\n GROUP BY year, gender\n ) as max_names\n WHERE gender='F' AND year IN (1885, 1915, 1945, 1975, 2005)\n ORDER BY year\n\"\"\"\nval historicNamesDF = spark.sql(historicNamesSql)\ndisplay(historicNamesDF)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106951594,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"c40c1beb-fa4d-469a-bdde-f7f52a64d853"},{"version":"CommandV1","origId":1561777,"guid":"e0d421d5-a9f5-40a6-956a-cfa365789ae1","subtype":"command","commandType":"auto","position":31.5,"command":"%python\n# ANSWER\nhistoricNamesSql = \"\"\"\n SELECT firstName, year, total\n FROM ssa_names\n NATURAL INNER JOIN (\n SELECT year, gender, MAX(total) AS total\n FROM ssa_names\n GROUP BY year, gender\n ) as max_names\n WHERE gender=\"F\" AND year IN (1885, 1915, 1945, 1975, 2005)\n ORDER BY year\n\"\"\"\nhistoricNamesDF = spark.sql(historicNamesSql)\ndisplay(historicNamesDF)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106951640,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"bf1c1f6d-987b-41cd-8164-847d925c6766"},{"version":"CommandV1","origId":1561778,"guid":"84be2e78-9a58-469a-8ffa-d7083125c682","subtype":"command","commandType":"auto","position":32.0,"command":"%scala\n// TEST\n// Run this test case to verify your answer to the exercise above is correct.\nassert(\n historicNamesDF.as[(String, Int, Int)].collect().toList ==\n List((\"Mary\",1885,9128), (\"Mary\",1915,58187), (\"Mary\",1945,59284), (\"Jennifer\",1975,58185), (\"Emily\",2005,23928))\n)\n\n// Alternate approach to doing tests. It may be easier for a lay person to understand, but also less elegant.\nassert(\n historicNamesDF.collect().toList.toString == \n \"List([Mary,1885,9128], [Mary,1915,58187], [Mary,1945,59284], [Jennifer,1975,58185], [Emily,2005,23928])\"\n)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106951691,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"8202fd4e-19d7-4aa6-ac38-2768809fce63"},{"version":"CommandV1","origId":1561779,"guid":"fbe10f11-1440-4f9f-9e00-51e1e7a47cbf","subtype":"command","commandType":"auto","position":32.5,"command":"%python\n# TEST\n# Run this test case to verify your answer to the exercise above is correct.\nassert(\n [(row.firstName, row.year, row.total) for row in historicNamesDF.collect()] == \n [(u'Mary', 1885, 9128), (u'Mary', 1915, 58187), (u'Mary', 1945, 59284), (u'Jennifer', 1975, 58185), (u'Emily', 2005, 23928)]\n)\n\n# Alternate approach to doing tests. It may be easier for a lay person to understand, but also less elegant.\nassert(\n str(historicNamesDF.collect()) == \n \"[Row(firstName=u'Mary', year=1885, total=9128), Row(firstName=u'Mary', year=1915, total=58187), Row(firstName=u'Mary', year=1945, total=59284), Row(firstName=u'Jennifer', year=1975, total=58185), Row(firstName=u'Emily', year=2005, total=23928)]\"\n)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"workflows":[],"startTime":0,"submitTime":1498106951741,"finishTime":0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"a user","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"streamStates":{},"nuid":"8ad15b44-faeb-4c5a-83c1-2055f5e8bd30"}],"dashboards":[],"guid":"b94cc7c9-7fc6-40e3-b8f9-3828a2e54bcc","globalVars":{},"iPythonMetadata":null,"inputWidgets":{}};</script>
<script
src="https://databricks-prod-cloudfront.cloud.databricks.com/static/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/js/notebook-main.js"
onerror="window.mainJsLoadError = true;"></script>
</head>
<body>
<script>
if (window.mainJsLoadError) {
var u = 'https://databricks-prod-cloudfront.cloud.databricks.com/static/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/js/notebook-main.js';
var b = document.getElementsByTagName('body')[0];
var c = document.createElement('div');
c.innerHTML = ('<h1>Network Error</h1>' +
'<p><b>Please check your network connection and try again.</b></p>' +
'<p>Could not load a required resource: ' + u + '</p>');
c.style.margin = '30px';
c.style.padding = '20px 50px';
c.style.backgroundColor = '#f5f5f5';
c.style.borderRadius = '5px';
b.appendChild(c);
}
</script>
</body>
</html>