Using AWS Glue to discover data

Now that you have all of the data you want to analyze in your S3 data lake, it is time to discover that data and make it available to be queried.

  • In the AWS Management Console, go to Services, and click AWS Glue or click this quick link.

  • On the left-side navigation bar, select Crawlers. You are creating a Glue Crawler, which will crawl through the data in your S3 bucket. It is going to connect to your S3 data store and classify it to determine the schema and metadata.

  • Select Add crawler.

  • Enter a crawler name. For this lab, peculiar-crawler will be used as the name.

  • Select Next. Leave the crawler source type as default, which should be Data stores. Select Next.

  • Now, you choose the data store you want to crawl through. It should be defaulted to S3, but you still need to specify the bucket and path of the data you want to discover. Select the folder icon to navigate to the path that the data is at.

  • Find your bucket and select it at the root of the bucket. The path should look like this:

  • Hit Select and then Next. Do not add another data store - continue hitting Next.

  • You need to create an IAM role for your Glue crawler to allow it permissions to access resources it might need access to. Create a role, give it a name, and select Next.

  • Keep the Frequency to run on demand and select Next.

  • On the page where you define a database, select Add database and give it a name. This lab will use the name peculiar-data-catalog. Hit Create and then Next.

You just created a Glue Data Catalog, which contains references to your data in S3. It is an index to the location, schema, and runtime metrics of your data and is populated by the Glue crawler.

  • Review your configurations and select Finish to create the crawler.

  • You should be redirected to AWS Glue dashboard. Find the crawler you just created, select it, and hit Run crawler. It might take a few minutes for your crawler to run, but when it is done it should say that a table has been added. Wait for your crawler to finish running.

  • On the left-side navigation bar, select Databases. You should see the Glue Data Catalog that you have created. Select it and then click the link to view the tables in your catalog.

  • You should see a table called the name of your S3 bucket - in this case, this lab uses the name peculiar_wizards_data_lake but since S3 buckets need to be globally unique, your name will be different.

  • Click into the table to view details about the table, like the data classification, the input and output format, object count, compression type, and other table properties.

You will also see the schema automatically inferred by the glue crawler:

Congratulations! You have successfully used AWS Glue to create a crawler and populate a Glue Data Catalog to discover the data in your S3 data lake.