DeepOLA: Python Interface Thoughts

py-wake
=======
- Things to figure out
    - Drop in performance when having Python with Rust backend.

- Interface similar to pandas:
    - `df1 = edf.read_csv(file_names)` - Only creates and doesn't execute.
      - To obtain actual result, you need to call another function on the dataframe.
      - WAY 1: Getting Result 1 at a time - Extrinsic Result of this DF.
         ```
          while(df1.has_result())
            result_df = df1.result()
         ``` 

      - WAY 2: Getting Result Progressively. Stopping execution when good enough.
        `df1.run_online(callback_function)`
        - The good part is - stopping execution stops execution of all nodes.
        - We are still able to have some benefit of pipelining since multiple steps.
        - How to provide a mechanism to stop execution while preserving progress?
          - Can have an atomic variable which is checked in the Node execution.
          - If the variable is true, then that node's execution is paused.

    - `df2 = df1.filter(lambda x: x["quantity"] > 5.0)`
        Now, df2 is an execution graph - a custom class which has (1) execution service, (2) the list of nodes, and (3) their execution output channels (which will remain unconsumed, so that the variable can be later re-used in another query):

        ```
                    [read_csv]    [appender] => [RESULT-DF2]
                        ||       //
                    [RESULT-DF1]
        ```
      - When you call result() on this, it starts execution on the read_csv() and appender() node.
      - read_csv() when executing writes to RESULT-DF1, appender reads from RESULT-DF1 and writes to RESULT-DF2. Thus, instead of subscribing to a node, need another public interface to subscribe to a channel.

    - `df3 = df2.map(lambda x: col("revenue", (1 - x["discount"]) * x["extendedprice"])))`
        ```
            [read_csv]      [appender]   [appender] => [RESULT-DF3]
                ||       //     ||      //
            [RESULT-DF1]    [RESULT-DF2]
        ```
        - Now, each variable df1, df2 => Their corresponding nodes have output results saved.
        - Thus, by default we can make it to be inplace update. So, that copies are not created.
    
    - Alternate way - by default inplace. 
    ```
    df1 = edf.read_csv([List of files])
    df1.filter(lambda x: x["quantity"] > 5.0)
    df1.map(lambda x: col("revenue", (1 - x["discount"]) * x["extendedprice"])))
    
    [read_csv] => [appender] => [appender] => [RESULT-DF1]
    ```

- Challenges
  - Overhead of stopping execution in-between.
  - If supporting re-using variables across different queries, then that will lead to keeping the output reader available. Higher memory usage.
  - How to be able to transform python functions or expressions to Rust functions/lambdas?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepOLA: Python Interface Thoughts #159

py-wake

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DeepOLA: Python Interface Thoughts #159

Description

py-wake

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions