Databricks_Projects/Databricks-Spark-SQL-ETL.md at main · DarrenDavy12/Databricks_Projects

Spark SQL ETL Pipeline

Objective: Build an ETL pipeline using Spark SQL

Step 1: Created a table

spark.sql(""" CREATE TABLE IF NOT EXISTS nyc_taxi USING DELTA LOCATION '/mnt/delta/nyc-taxi' """)

Step 2: Cleaned data (e.g., remove nulls)

spark.sql(""" CREATE OR REPLACE VIEW cleaned_taxi AS SELECT * FROM nyc_taxi WHERE passenger_count IS NOT NULL AND trip_distance > 0 """)

Step 3: Created a UDF (function)

spark.sql(""" CREATE FUNCTION calculate_fare_per_mile(fare FLOAT, distance FLOAT) RETURNS FLOAT RETURN fare / distance """)

Tip: Use spark.sql (""" DROP FUNCTION <function_name> """) to delete the function if error saying there is one that still exists, afterwards continue with 'CREATE FUNCTION' command .

Step 4: Transformed data

spark.sql(""" SELECT *, calculate_fare_per_mile(total_amount, trip_distance) AS fare_per_mile FROM cleaned_taxi """).write.mode("overwrite").format("delta").save("/mnt/delta/transformed-taxi")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark SQL ETL Pipeline

Objective: Build an ETL pipeline using Spark SQL

Step 1: Created a table

Step 2: Cleaned data (e.g., remove nulls)

Step 3: Created a UDF (function)

Step 4: Transformed data

FilesExpand file tree

Databricks-Spark-SQL-ETL.md

Latest commit

History

Databricks-Spark-SQL-ETL.md

File metadata and controls

Spark SQL ETL Pipeline

Objective: Build an ETL pipeline using Spark SQL

Step 1: Created a table

Step 2: Cleaned data (e.g., remove nulls)

Step 3: Created a UDF (function)

Step 4: Transformed data