You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Drop in performance when having Python with Rust backend.
Interface similar to pandas:
df1 = edf.read_csv(file_names) - Only creates and doesn't execute.
To obtain actual result, you need to call another function on the dataframe.
WAY 1: Getting Result 1 at a time - Extrinsic Result of this DF.
while(df1.has_result())
result_df = df1.result()
WAY 2: Getting Result Progressively. Stopping execution when good enough. df1.run_online(callback_function)
The good part is - stopping execution stops execution of all nodes.
We are still able to have some benefit of pipelining since multiple steps.
How to provide a mechanism to stop execution while preserving progress?
Can have an atomic variable which is checked in the Node execution.
If the variable is true, then that node's execution is paused.
df2 = df1.filter(lambda x: x["quantity"] > 5.0)
Now, df2 is an execution graph - a custom class which has (1) execution service, (2) the list of nodes, and (3) their execution output channels (which will remain unconsumed, so that the variable can be later re-used in another query):
When you call result() on this, it starts execution on the read_csv() and appender() node.
read_csv() when executing writes to RESULT-DF1, appender reads from RESULT-DF1 and writes to RESULT-DF2. Thus, instead of subscribing to a node, need another public interface to subscribe to a channel.
py-wake
Things to figure out
Interface similar to pandas:
df1 = edf.read_csv(file_names)- Only creates and doesn't execute.To obtain actual result, you need to call another function on the dataframe.
WAY 1: Getting Result 1 at a time - Extrinsic Result of this DF.
WAY 2: Getting Result Progressively. Stopping execution when good enough.
df1.run_online(callback_function)df2 = df1.filter(lambda x: x["quantity"] > 5.0)Now, df2 is an execution graph - a custom class which has (1) execution service, (2) the list of nodes, and (3) their execution output channels (which will remain unconsumed, so that the variable can be later re-used in another query):
df3 = df2.map(lambda x: col("revenue", (1 - x["discount"]) * x["extendedprice"])))Alternate way - by default inplace.
Challenges