Skip to content

Commit ed93692

Browse files
authored
Merge pull request #256 from UPPMAX/bclaremar-patch-2
big_data.rst format polars
2 parents 208cd34 + d6248a4 commit ed93692

1 file changed

Lines changed: 33 additions & 45 deletions

File tree

docs/day3/big_data.rst

Lines changed: 33 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -220,8 +220,9 @@ Exercise: Memory allocation (10 min)
220220
- Since it may take some time to get the allocation we do it now already!
221221
- Follow the best procedure for your cluster, e.g. from **command-line** or **OnDemand**.
222222

223-
.. challenge:: How?
224-
:class: drop-down
223+
.. admonition:: How?
224+
:class: dropdown
225+
225226

226227
The following Slurm options needs to be set
227228

@@ -330,6 +331,7 @@ File formats
330331
------------
331332

332333
.. admonition:: Bits and Bytes
334+
:class: dropdown
333335

334336
- The smallest building block of storage and memory (RAM) in the computer is a bit, which stores either a 0 or 1.
335337
- Normally a number of 8 bits are combined in a group to make a byte.
@@ -584,7 +586,7 @@ An overview of common data formats
584586

585587
Adapted from Aalto university's `Python for scientific computing <https://aaltoscicomp.github.io/python-for-scicomp/work-with-data/#what-is-a-data-format>`__
586588

587-
... seealso::
589+
.. seealso::
588590

589591
- ENCCS course "HPDA-Python": `Scientific data <https://enccs.github.io/hpda-python/scientific-data/>`_
590592
- Aalto Scientific Computing course "Python for Scientific Computing": `Xarray <https://aaltoscicomp.github.io/python-for-scicomp/xarray/>`_
@@ -597,16 +599,16 @@ Exercise file formats (10 minutes)
597599
- Read: https://stackoverflow.com/questions/49854065/python-netcdf4-library-ram-usage
598600
- What about using NETCDF files and memory?
599601

600-
.. challenge::
601-
602-
- Start Jupyter or just a Python shell and
603-
- Go though and test the lines at the page at https://docs.scipy.org/doc/scipy-1.13.1/reference/generated/scipy.io.netcdf_file.html
604-
605-
.. challenge::
602+
.. challenge:: View file formats
606603

607604
- Go over file formats and see if some are more relevant for your work.
608605
- Would you look at other file formats and why?
609606

607+
.. challenge:: (optional)
608+
609+
- Start Jupyter or just a Python shell and
610+
- Go though and test the lines at the page at https://docs.scipy.org/doc/scipy-1.13.1/reference/generated/scipy.io.netcdf_file.html
611+
610612

611613
Computing efficiency with Python
612614
--------------------------------
@@ -629,42 +631,16 @@ Xarray package
629631
..............
630632

631633
- ``xarray`` is a Python package that builds on NumPy but adds labels to **multi-dimensional arrays**.
632-
- introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience.
633634

634-
- It also borrows heavily from the Pandas package for labelled tabular data and integrates tightly with dask for parallel computing.
635+
- introduces **labels in the form of dimensions, coordinates and attributes** on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience.
636+
- It also **borrows heavily from the Pandas package for labelled tabular data** and integrates tightly with dask for parallel computing.
635637

636-
- Xarray is particularly tailored to working with NetCDF files.
637-
- It reads and writes to NetCDF file using
638+
- Xarray is particularly tailored to working with NetCDF files.
639+
- But work for aother files as well
638640

639641
- Explore it a bit in the (optional) exercise below!
640642

641-
Polars package
642-
..............
643-
644-
**Blazingly Fast DataFrame Library**
645-
646-
.. admonition:: Goals
647643

648-
The goal of Polars is to provide a lightning fast DataFrame library that:
649-
650-
- Utilizes all available cores on your machine.
651-
- Optimizes queries to reduce unneeded work/memory allocations.
652-
- Handles datasets much larger than your available RAM.
653-
- A consistent and predictable API.
654-
- Adheres to a strict schema (data-types should be known before running the query).
655-
656-
.. admonition:: Key features
657-
:class: drop-down
658-
659-
- Fast: Written from scratch in Rust
660-
- I/O: First class support for all common data storage layers:
661-
- Intuitive API: Write your queries the way they were intended. Internally, there is a query optimizer.
662-
- Out of Core: streaming without requiring all your data to be in memory at the same time.
663-
- Parallel: dividing the workload among the available CPU cores without any additional configuration.
664-
- GPU Support: Optionally run queries on NVIDIA GPUs
665-
- Apache Arrow support
666-
667-
https://pola.rs/
668644

669645
Dask
670646
----
@@ -751,16 +727,28 @@ Big file → split into chunks → parallel workers → results combined.
751727

752728
- Briefly explain what happens when a Dask job runs on multiple cores.
753729

730+
Polars package
731+
..............
754732

733+
- ``polars`` is a Python package that presnts itself as **Blazingly Fast DataFrame Library**
734+
- Utilizes all available cores on your machine.
735+
- Optimizes queries to reduce unneeded work/memory allocations.
736+
- Handles datasets much larger than your available RAM.
737+
- A consistent and predictable API.
738+
- Adheres to a strict schema (data-types should be known before running the query).
755739

756-
Exercise DASK
757-
-------------
758-
759-
760-
761-
740+
.. admonition:: Key features
741+
:class: dropdown
762742

743+
- Fast: Written from scratch in **Rust**
744+
- I/O: First class **support for all common data storage** layers
745+
- **Intuitive API**: Write your queries the way they were intended. Internally, there is a query optimizer.
746+
- Out of Core: **streaming** without requiring all your data to be in memory at the same time. I.e. **chunking**
747+
- **Parallel**: dividing the workload among the available CPU cores without any additional configuration.
748+
- GPU Support: Optionally run queries on **NVIDIA GPUs**
749+
- `Apache Arrow <https://arrow.apache.org/overview/>`_ support
763750

751+
https://pola.rs/
764752

765753
Workflow
766754
--------

0 commit comments

Comments
 (0)