|
32 | 32 | "\n", |
33 | 33 | "- Taking notes in the lecture notebooks\n", |
34 | 34 | "- Using [another Python/pandas learning resource](https://python-public-policy.afeld.me/en/{{school_slug}}/resources.html)\n", |
35 | | - " - Hear things explained another way\n", |
36 | | - " - Ask in [Ed Discussion]({{discussions_url}}) if others have recommendations\n", |
| 35 | + " - Hear things explained another way\n", |
| 36 | + " - Ask in [Ed Discussion]({{discussions_url}}) if others have recommendations\n", |
37 | 37 | "- [Comment-driven development](https://www.sitepoint.com/comment-driven-development/)\n", |
38 | | - " - Otherwise, trying to do two steps in your head:\n", |
39 | | - " 1. Figuring out the logic\n", |
40 | | - " 1. Figuring out the syntax" |
| 38 | + " - Otherwise, trying to do two steps in your head:\n", |
| 39 | + " 1. Figuring out the logic\n", |
| 40 | + " 1. Figuring out the syntax\n" |
41 | 41 | ] |
42 | 42 | }, |
43 | 43 | { |
|
57 | 57 | "```python\n", |
58 | 58 | "# find valid ZIP codes\n", |
59 | 59 | "# filter the DataFrame to only invalid ZIP codes\n", |
60 | | - "```" |
| 60 | + "```\n" |
61 | 61 | ] |
62 | 62 | }, |
63 | 63 | { |
|
70 | 70 | "tags": [] |
71 | 71 | }, |
72 | 72 | "source": [ |
73 | | - "## [Boolean indexing](https://pandas.pydata.org/docs/user_guide/10min.html#boolean-indexing)" |
| 73 | + "## [Boolean indexing](https://pandas.pydata.org/docs/user_guide/10min.html#boolean-indexing)\n" |
74 | 74 | ] |
75 | 75 | }, |
76 | 76 | { |
|
225 | 225 | "tags": [] |
226 | 226 | }, |
227 | 227 | "source": [ |
228 | | - "When we compare single values (like `x > 6`), we get a single boolean back. Here, we are checking a _bunch_ of values, so we're going to get multiple booleans, returned as a Series." |
| 228 | + "When we compare single values (like `x > 6`), we get a single boolean back. Here, we are checking a _bunch_ of values, so we're going to get multiple booleans, returned as a Series.\n" |
229 | 229 | ] |
230 | 230 | }, |
231 | 231 | { |
|
365 | 365 | "\n", |
366 | 366 | "```python\n", |
367 | 367 | "people[people[\"age\"] > 40]\n", |
368 | | - "```" |
| 368 | + "```\n" |
369 | 369 | ] |
370 | 370 | }, |
371 | 371 | { |
|
382 | 382 | "\n", |
383 | 383 | "> Data Cleansing is a process of removing or fixing incorrect, malformed, incomplete, duplicate, or corrupted data\n", |
384 | 384 | "\n", |
385 | | - "https://hevodata.com/learn/data-cleansing-a-simplified-guide/" |
| 385 | + "https://hevodata.com/learn/data-cleansing-a-simplified-guide/\n" |
386 | 386 | ] |
387 | 387 | }, |
388 | 388 | { |
|
395 | 395 | "tags": [] |
396 | 396 | }, |
397 | 397 | "source": [ |
398 | | - "When have you needed to clean data?" |
| 398 | + "When have you needed to clean data?\n" |
399 | 399 | ] |
400 | 400 | }, |
401 | 401 | { |
|
408 | 408 | "tags": [] |
409 | 409 | }, |
410 | 410 | "source": [ |
411 | | - "What are continuous values?" |
| 411 | + "What are continuous values?\n" |
412 | 412 | ] |
413 | 413 | }, |
414 | 414 | { |
|
421 | 421 | "tags": [] |
422 | 422 | }, |
423 | 423 | "source": [ |
424 | | - "What are categorical values?" |
| 424 | + "What are categorical values?\n" |
425 | 425 | ] |
426 | 426 | }, |
427 | 427 | { |
|
439 | 439 | "From [my workshop on data cleaning](https://github.com/afeld/data-cleaning):\n", |
440 | 440 | "\n", |
441 | 441 | "- Missing data\n", |
442 | | - " - Empty values\n", |
| 442 | + " - Empty values\n", |
443 | 443 | "- Bad (junk) values\n", |
444 | | - " - Duplicates\n", |
445 | | - " - Mismatched types/formatting\n", |
| 444 | + " - Duplicates\n", |
| 445 | + " - Mismatched types/formatting\n", |
446 | 446 | "- Categorical values\n", |
447 | | - " - Uniqueness (cardinality)\n", |
448 | | - " - Value counts\n", |
| 447 | + " - Uniqueness (cardinality)\n", |
| 448 | + " - Value counts\n", |
449 | 449 | "- Continuous values\n", |
450 | | - " - Ranges\n", |
451 | | - " - Spread (distribution)" |
| 450 | + " - Ranges\n", |
| 451 | + " - Spread (distribution)\n" |
452 | 452 | ] |
453 | 453 | }, |
454 | 454 | { |
|
464 | 464 | "Notes:\n", |
465 | 465 | "\n", |
466 | 466 | "- \"Values\" in this case can be a single cell (in the spreadsheet sense) or a whole row\n", |
467 | | - "- \"Missing\" or \"duplicates\" can be columns (Series), tables (DataFrames), rows, or cells" |
| 467 | + "- \"Missing\" or \"duplicates\" can be columns (Series), tables (DataFrames), rows, or cells\n" |
468 | 468 | ] |
469 | 469 | }, |
470 | 470 | { |
|
482 | 482 | "- Empty\n", |
483 | 483 | "- Bad\n", |
484 | 484 | "- Unique\n", |
485 | | - "- Spread" |
| 485 | + "- Spread\n" |
486 | 486 | ] |
487 | 487 | }, |
488 | 488 | { |
|
496 | 496 | "tags": [] |
497 | 497 | }, |
498 | 498 | "source": [ |
499 | | - "## Setup" |
| 499 | + "## Setup\n" |
500 | 500 | ] |
501 | 501 | }, |
502 | 502 | { |
|
528 | 528 | "tags": [] |
529 | 529 | }, |
530 | 530 | "source": [ |
531 | | - "### Read our cleaned 311 Service Requests dataset" |
| 531 | + "### Read our cleaned 311 Service Requests dataset\n" |
532 | 532 | ] |
533 | 533 | }, |
534 | 534 | { |
|
571 | 571 | "\n", |
572 | 572 | "More data cleaning!\n", |
573 | 573 | "\n", |
574 | | - "" |
| 574 | + "\n" |
575 | 575 | ] |
576 | 576 | }, |
577 | 577 | { |
|
586 | 586 | "source": [ |
587 | 587 | "```\n", |
588 | 588 | "DtypeWarning: Columns (8,20,31,34) have mixed types.\n", |
589 | | - "```" |
| 589 | + "```\n" |
590 | 590 | ] |
591 | 591 | }, |
592 | 592 | { |
|
1273 | 1273 | "tags": [] |
1274 | 1274 | }, |
1275 | 1275 | "source": [ |
1276 | | - "ZIP codes _look_ numeric, but aren't really." |
| 1276 | + "ZIP codes _look_ numeric, but aren't really.\n" |
1277 | 1277 | ] |
1278 | 1278 | }, |
1279 | 1279 | { |
|
1286 | 1286 | "tags": [] |
1287 | 1287 | }, |
1288 | 1288 | "source": [ |
1289 | | - "[Read the ZIP codes in as strings.](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#text-data-types)" |
| 1289 | + "[Read the ZIP codes in as strings.](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#text-data-types)\n" |
1290 | 1290 | ] |
1291 | 1291 | }, |
1292 | 1292 | { |
|
1323 | 1323 | "tags": [] |
1324 | 1324 | }, |
1325 | 1325 | "source": [ |
1326 | | - "We fixed the dtype warning for column 8 (`Incident Zip`)." |
| 1326 | + "We fixed the dtype warning for column 8 (`Incident Zip`).\n" |
1327 | 1327 | ] |
1328 | 1328 | }, |
1329 | 1329 | { |
|
1728 | 1728 | "└─ start of string\n", |
1729 | 1729 | "```\n", |
1730 | 1730 | "\n", |
1731 | | - "[regex101](https://regex101.com/) is useful for testing them." |
| 1731 | + "Helpful tools:\n", |
| 1732 | + "\n", |
| 1733 | + "- [Regexper](https://regexper.com/#%5E%5Cd%7B5%7D%28%3F%3A-%5Cd%7B4%7D%29%3F%24)\n", |
| 1734 | + "- [regex101](https://regex101.com/)\n" |
1732 | 1735 | ] |
1733 | 1736 | }, |
1734 | 1737 | { |
|
1911 | 1914 | "tags": [] |
1912 | 1915 | }, |
1913 | 1916 | "source": [ |
1914 | | - "[Clear](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#inserting-missing-data) any invalid ZIP codes:" |
| 1917 | + "[Clear](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#inserting-missing-data) any invalid ZIP codes:\n" |
1915 | 1918 | ] |
1916 | 1919 | }, |
1917 | 1920 | { |
|
1939 | 1942 | "tags": [] |
1940 | 1943 | }, |
1941 | 1944 | "source": [ |
1942 | | - "[`.loc[]`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) is used for overwriting a subset of values." |
| 1945 | + "[`.loc[]`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) is used for overwriting a subset of values.\n" |
1943 | 1946 | ] |
1944 | 1947 | }, |
1945 | 1948 | { |
|
1956 | 1959 | "\n", |
1957 | 1960 | "- Hard part is finding what needs to be done\n", |
1958 | 1961 | "- Will be specific to your use case\n", |
1959 | | - "- Document what you did, since it will affect your results" |
| 1962 | + "- Document what you did, since it will affect your results\n" |
1960 | 1963 | ] |
1961 | 1964 | }, |
1962 | 1965 | { |
|
1969 | 1972 | "tags": [] |
1970 | 1973 | }, |
1971 | 1974 | "source": [ |
1972 | | - "## [In-class exercise](https://python-public-policy.afeld.me/en/{{school_slug}}/lecture_2_exercise.html)" |
| 1975 | + "## [In-class exercise](https://python-public-policy.afeld.me/en/{{school_slug}}/lecture_2_exercise.html)\n" |
1973 | 1976 | ] |
1974 | 1977 | }, |
1975 | 1978 | { |
|
1984 | 1987 | ] |
1985 | 1988 | }, |
1986 | 1989 | "source": [ |
1987 | | - "## [Concatenation](https://pandas.pydata.org/docs/user_guide/merging.html#concat)" |
| 1990 | + "## [Concatenation](https://pandas.pydata.org/docs/user_guide/merging.html#concat)\n" |
1988 | 1991 | ] |
1989 | 1992 | }, |
1990 | 1993 | { |
|
2250 | 2253 | "tags": [] |
2251 | 2254 | }, |
2252 | 2255 | "source": [ |
2253 | | - "## Simple [merge](https://pandas.pydata.org/docs/user_guide/merging.html#merge)" |
| 2256 | + "## Simple [merge](https://pandas.pydata.org/docs/user_guide/merging.html#merge)\n" |
2254 | 2257 | ] |
2255 | 2258 | }, |
2256 | 2259 | { |
|
2263 | 2266 | "tags": [] |
2264 | 2267 | }, |
2265 | 2268 | "source": [ |
2266 | | - "_I had [Copilot](https://code.visualstudio.com/docs/copilot/overview) generate the DataFrames, so no idea if the numbers are real._" |
| 2269 | + "_I had [Copilot](https://code.visualstudio.com/docs/copilot/overview) generate the DataFrames, so no idea if the numbers are real._\n" |
2267 | 2270 | ] |
2268 | 2271 | }, |
2269 | 2272 | { |
|
2445 | 2448 | "tags": [] |
2446 | 2449 | }, |
2447 | 2450 | "source": [ |
2448 | | - "How should we combine them?" |
| 2451 | + "How should we combine them?\n" |
2449 | 2452 | ] |
2450 | 2453 | }, |
2451 | 2454 | { |
|
2617 | 2620 | "source": [ |
2618 | 2621 | "To join DataFrames together, we will use the [pandas `.merge()` function](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/08_combine_dataframes.html#join-tables-using-a-common-identifier).\n", |
2619 | 2622 | "\n", |
2620 | | - "" |
| 2623 | + "\n" |
2621 | 2624 | ] |
2622 | 2625 | }, |
2623 | 2626 | { |
|
2635 | 2638 | "- [SQL `JOIN`](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html#join)\n", |
2636 | 2639 | "- [Spreadsheet `VLOOKUP`](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_spreadsheets.html#merging)\n", |
2637 | 2640 | "\n", |
2638 | | - "In general, called [\"record linkage\" or \"entity resolution\"](https://en.wikipedia.org/wiki/Record_linkage)." |
| 2641 | + "In general, called [\"record linkage\" or \"entity resolution\"](https://en.wikipedia.org/wiki/Record_linkage).\n" |
2639 | 2642 | ] |
2640 | 2643 | }, |
2641 | 2644 | { |
|
2815 | 2818 | "tags": [] |
2816 | 2819 | }, |
2817 | 2820 | "source": [ |
2818 | | - "[Different types of merges](https://www.geeksforgeeks.org/different-types-of-joins-in-pandas/)" |
| 2821 | + "[Different types of merges](https://www.geeksforgeeks.org/different-types-of-joins-in-pandas/)\n" |
2819 | 2822 | ] |
2820 | 2823 | }, |
2821 | 2824 | { |
|
2832 | 2835 | "source": [ |
2833 | 2836 | "## In-class exercise 2\n", |
2834 | 2837 | "\n", |
2835 | | - "Compute the migrant population as a percent of total by country using [UN data](https://data.un.org/). You're welcome to talk with your neighbors." |
| 2838 | + "Compute the migrant population as a percent of total by country using [UN data](https://data.un.org/). You're welcome to talk with your neighbors.\n" |
2836 | 2839 | ] |
2837 | 2840 | }, |
2838 | 2841 | { |
|
2845 | 2848 | "tags": [] |
2846 | 2849 | }, |
2847 | 2850 | "source": [ |
2848 | | - "## [Homework 2](https://python-public-policy.afeld.me/en/{{school_slug}}/hw_2.html)" |
| 2851 | + "## [Homework 2](https://python-public-policy.afeld.me/en/{{school_slug}}/hw_2.html)\n" |
2849 | 2852 | ] |
2850 | 2853 | } |
2851 | 2854 | ], |
|
0 commit comments