Skip to content

Latest commit

 

History

History
56 lines (35 loc) · 1.32 KB

File metadata and controls

56 lines (35 loc) · 1.32 KB

Spark SQL ETL Pipeline

Objective: Build an ETL pipeline using Spark SQL

Step 1: Created a table

spark.sql(""" CREATE TABLE IF NOT EXISTS nyc_taxi USING DELTA LOCATION '/mnt/delta/nyc-taxi' """)

Image


Step 2: Cleaned data (e.g., remove nulls)

spark.sql(""" CREATE OR REPLACE VIEW cleaned_taxi AS SELECT * FROM nyc_taxi WHERE passenger_count IS NOT NULL AND trip_distance > 0 """)

Image


Step 3: Created a UDF (function)

spark.sql(""" CREATE FUNCTION calculate_fare_per_mile(fare FLOAT, distance FLOAT) RETURNS FLOAT RETURN fare / distance """)

Image

Tip: Use spark.sql (""" DROP FUNCTION <function_name> """) to delete the function if error saying there is one that still exists, afterwards continue with 'CREATE FUNCTION' command .


Step 4: Transformed data

spark.sql(""" SELECT *, calculate_fare_per_mile(total_amount, trip_distance) AS fare_per_mile FROM cleaned_taxi """).write.mode("overwrite").format("delta").save("/mnt/delta/transformed-taxi")

Image