You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Since it may take some time to get the allocation we do it now already!
221
221
- Follow the best procedure for your cluster, e.g. from **command-line** or **OnDemand**.
222
222
223
-
.. challenge:: How?
224
-
:class: drop-down
223
+
.. admonition:: How?
224
+
:class: dropdown
225
+
225
226
226
227
The following Slurm options needs to be set
227
228
@@ -330,6 +331,7 @@ File formats
330
331
------------
331
332
332
333
.. admonition:: Bits and Bytes
334
+
:class: dropdown
333
335
334
336
- The smallest building block of storage and memory (RAM) in the computer is a bit, which stores either a 0 or 1.
335
337
- Normally a number of 8 bits are combined in a group to make a byte.
@@ -584,7 +586,7 @@ An overview of common data formats
584
586
585
587
Adapted from Aalto university's `Python for scientific computing <https://aaltoscicomp.github.io/python-for-scicomp/work-with-data/#what-is-a-data-format>`__
586
588
587
-
... seealso::
589
+
.. seealso::
588
590
589
591
- ENCCS course "HPDA-Python": `Scientific data <https://enccs.github.io/hpda-python/scientific-data/>`_
- Go though and test the lines at the page at https://docs.scipy.org/doc/scipy-1.13.1/reference/generated/scipy.io.netcdf_file.html
604
-
605
-
.. challenge::
602
+
.. challenge:: View file formats
606
603
607
604
- Go over file formats and see if some are more relevant for your work.
608
605
- Would you look at other file formats and why?
609
606
607
+
.. challenge:: (optional)
608
+
609
+
- Start Jupyter or just a Python shell and
610
+
- Go though and test the lines at the page at https://docs.scipy.org/doc/scipy-1.13.1/reference/generated/scipy.io.netcdf_file.html
611
+
610
612
611
613
Computing efficiency with Python
612
614
--------------------------------
@@ -629,42 +631,16 @@ Xarray package
629
631
..............
630
632
631
633
- ``xarray`` is a Python package that builds on NumPy but adds labels to **multi-dimensional arrays**.
632
-
- introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience.
633
634
634
-
- It also borrows heavily from the Pandas package for labelled tabular data and integrates tightly with dask for parallel computing.
635
+
- introduces **labels in the form of dimensions, coordinates and attributes** on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience.
636
+
- It also **borrows heavily from the Pandas package for labelled tabular data** and integrates tightly with dask for parallel computing.
635
637
636
-
- Xarray is particularly tailored to working with NetCDF files.
637
-
- It reads and writes to NetCDF file using
638
+
- Xarray is particularly tailored to working with NetCDF files.
639
+
- But work for aother files as well
638
640
639
641
- Explore it a bit in the (optional) exercise below!
640
642
641
-
Polars package
642
-
..............
643
-
644
-
**Blazingly Fast DataFrame Library**
645
-
646
-
.. admonition:: Goals
647
643
648
-
The goal of Polars is to provide a lightning fast DataFrame library that:
649
-
650
-
- Utilizes all available cores on your machine.
651
-
- Optimizes queries to reduce unneeded work/memory allocations.
652
-
- Handles datasets much larger than your available RAM.
653
-
- A consistent and predictable API.
654
-
- Adheres to a strict schema (data-types should be known before running the query).
655
-
656
-
.. admonition:: Key features
657
-
:class: drop-down
658
-
659
-
- Fast: Written from scratch in Rust
660
-
- I/O: First class support for all common data storage layers:
661
-
- Intuitive API: Write your queries the way they were intended. Internally, there is a query optimizer.
662
-
- Out of Core: streaming without requiring all your data to be in memory at the same time.
663
-
- Parallel: dividing the workload among the available CPU cores without any additional configuration.
664
-
- GPU Support: Optionally run queries on NVIDIA GPUs
665
-
- Apache Arrow support
666
-
667
-
https://pola.rs/
668
644
669
645
Dask
670
646
----
@@ -751,16 +727,28 @@ Big file → split into chunks → parallel workers → results combined.
751
727
752
728
- Briefly explain what happens when a Dask job runs on multiple cores.
753
729
730
+
Polars package
731
+
..............
754
732
733
+
- ``polars`` is a Python package that presnts itself as **Blazingly Fast DataFrame Library**
734
+
- Utilizes all available cores on your machine.
735
+
- Optimizes queries to reduce unneeded work/memory allocations.
736
+
- Handles datasets much larger than your available RAM.
737
+
- A consistent and predictable API.
738
+
- Adheres to a strict schema (data-types should be known before running the query).
755
739
756
-
Exercise DASK
757
-
-------------
758
-
759
-
760
-
761
-
740
+
.. admonition:: Key features
741
+
:class: dropdown
762
742
743
+
- Fast: Written from scratch in **Rust**
744
+
- I/O: First class **support for all common data storage** layers
745
+
- **Intuitive API**: Write your queries the way they were intended. Internally, there is a query optimizer.
746
+
- Out of Core: **streaming** without requiring all your data to be in memory at the same time. I.e. **chunking**
747
+
- **Parallel**: dividing the workload among the available CPU cores without any additional configuration.
748
+
- GPU Support: Optionally run queries on **NVIDIA GPUs**
749
+
- `Apache Arrow <https://arrow.apache.org/overview/>`_ support
0 commit comments