PySpark and Glue together

Michael, 28 August 2018

I have been playing around with Spark (in EMR) and the Glue Data Catalog a bit and I really like using them together. The ability of you being able to use EMR to transform the data and then being able to query it in either Spark, Glue or Athena - and through Athena via a JDBC data source is a real winner.

That said, it isn’t really that clear on how you access and update the Glue Data Catalog from within EMR. This post will hopefully point you in the right direction.

EMR Setup

To work with the Glue Data Catalog the EMR cluster needs to be configured to use it. There are details for this at AWS, but within automation the following values need to be put into the cluster:

- Classification: "hive-site"
  Properties:
    'hive.metastore.client.factory.class': "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
- Classification: "spark-hive-site"
  Properties:
    'hive.metastore.client.factory.class': "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"

Accessing glue tables from EMR

To access the tables from within a Spark step you need to instantiate the spark session with the glue catalog:

spark = SparkSession.builder \
        .appName(job_name) \
        .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
        .enableHiveSupport() \
        .getOrCreate()
    spark.catalog.setCurrentDatabase("mydatabase")

The database name above is the name of the database within the Glue configurations.

Now that the Spark Session is setup correctly you can query the glue catalog through Spark SQL:

spark.sql("select * from mytable")

Note, you don’t need to specify the location of the table, all this information is stored in the Glue Data Catalog.

Creating Glue Data Catalog Tables from Spark on EMR

Now, the prevailing wisdom is that you use the glue crawlers to update the data catalog - my feeling is that where possible the catalog should be updated by the process that is actually landing (or modifying) the data. The advantage that this gives it allows subsequent steps to execute and use the updated catalog without needed to run a crawler.

To create the table in glue and save the data into parquet, run the following command:

dataframe.write.mode("overwrite").format("parquet").option("path", parquet_path).saveAsTable(glue_table)

Creating Glue Data Catalog Tables from Glue Jobs

Now, you would think that this is easy… but unfortunately it isn’t. This will be the subject of another post once I do more research.