Datalake - Google Cloud Platform

From Craft of Testing - Wiki
Jump to navigation Jump to search

What can we help you find?

Language Syntax
XPath Examples

Random
Home Testing Automation AI IoT OS RF Data Development References Tools Wisdom Inspiration Investing History Fun POC Help

Let's learn how to set up a Datalake using the Google Cloud Platform (GCP) that includes setting up the service and submitting/retrieving data via a web interface

The goals we have for this project are the following.

  1. Set Up Your Google Cloud Project
    1. Create a new project: Go to the Google Cloud Console and create a new project.
    2. Enable necessary APIs: Enable the App Engine Admin API and any other APIs you might need for your project.
  2. Create a Hello World Application
    1. Set up your development environment: Install the Google Cloud SDK and initialize it with your project.
    2. Create a simple Hello World application: You can use any programming language supported by GCP, such as Python, Node.js, Java,
  3. Deploy the Application to App Engine
    1. Create an App Engine application: Use the gcloud app create command to create an App Engine application.
    2. Deploy your application: Use the gcloud app deploy command to deploy your Hello World application.
    3. Test your application: Once deployed, you'll get a URL to access your application. You can visit this URL to see your "Hello, World!" message.
  4. Submit and Retrieve Data via Web Interface
    1. Create a web interface: You can create a simple HTML form to submit data and a page to display retrieved data.
    2. Handle data submission: Use a POST endpoint in your application to handle data submission.
    3. Store and retrieve data: You can use Google Cloud Storage or Firestore to store and retrieve data.

Project Architecture

  1. Google Cloud Console:
    • Purpose: Central management interface where you create and manage projects, enable APIs, and deploy services.
    • Components:
      • Project Management
      • API & Services Dashboard
  2. App Engine:
    • Purpose: Hosts your Java-based "Hello Data" web application. Handles HTTP requests and serves the web interface.
    • Components:
      • Environment Configuration (app.yaml)
      • Deployment Services
  3. Java 21 Application:
    • Purpose: Main application logic handling data submission and retrieval. Deployed on App Engine.
    • Components:
      • Spring Boot Framework: Used for creating the RESTful API for data submission and retrieval.
      • Google Cloud Storage Client: Interacts with Google Cloud Storage for data storage and retrieval.
  4. Google Cloud Storage:
    • Purpose: Acts as the data lake, storing raw data files of various types.
    • Components:
      • Storage Buckets: Containers for your data.
      • Object Storage: Stores the actual data files (blobs).
  5. Data Visualization (Optional):
    • Purpose: Visualize the data stored in your data lake.
    • Components:
      • Google Data Studio: Tool for creating interactive dashboards.
      • Grafana: Tool for advanced data visualization (requires setup).

Developing the Project

  1. Accessing Google Cloud
    1. https://console.cloud.google.com/
    2. Create a New Project
      1. I named mine "Hello Data"
    3. Enable Required API's
      1. Go to the Hamburger Menu
      2. Select "API's and Services"
        1. Enables APIs and services in Google Cloud and authenticate so your application to use them.
      3. Click "Enable API's and Services"
      4. Search for "App Engine Admin API" then click and enable this API
        1. Necessary for certain operations related to managing and deploying applications on Google App Engine.
      5. Search for "Cloud Storage API" then click and enable this API
        1. This allows storage and retrieve of large amounts of data in the cloud.
      6. Search for "Dataflow API" then click and enable this API
        1. Necessary for using Google Cloud Dataflow, which is a service for executing Apache Beam pipelines.
    4. Create Service Account Keys
      1. Go to URL: https://console.cloud.google.com/projectselector2/iam-admin/serviceaccounts?supportedpurview=project
      2. Select "Hello Data" project
      3. Find the service account "hello-data-441915@appspot.gserviceaccount.com"
      4. Click the three dots under "Action" and select "Manage Keys"
      5. Click button "Add Key" and select "Create New Key"
      6. Select Key Type "JSON" and click the "Create" button
      7. A dialog will come up to save the key. Save it to the Desktop for now. This is the "Private" key
      8. Create a .private folder in the root of your project and move this .json file into that folder
      9. Update your .gitignore file to include the following line. THis ensure this private key will not be deployed to GitHub
        # Ignore .private folder
        .private
      10. Copy the private key from your local computer to the Google Cloud Storage Bucket
        gsutil cp .private/hello-data-441915-d2e498612496.json gs://hello_data_441915_bucket/.private/
    5. Create Storage Bucket
      1. Go to the Hamburger Menu
      2. Select "Cloud Storage"
      3. Click "Buckets"
      4. Click "Create" to make a new bucket.
        1. Name: hello-data
        2. Storage Class: Standard
        3. Location: us-central1
        4. Public Access: Subject to object ACLs
        5. Protection: Soft Delete
    6. Install and Configure Python (Required for Google Cloud SDK)
      1. Homebrew should already be installed on your Mac
      2. In a terminal type "brew install pyenv"
      3. In a terminal type "pyenv install 3.11.6" which is the current required version of Python for Google Cloud SDK
      4. In a terminal type "pyenv local 3.11.6" to ensure you are using the correct version fo Python
      5. In a terminal type "python --version" and versify you get "Python 3.11.6"
    7. Configure .zshrc to support the newly installed Python
      1. Modify .zshrc in your user root directory.
      2. Open .zshrc in a text editor
      3. Add the following in this file.
        # Python Configuration - # Pyenv initialization
        export PYENV_ROOT="$HOME/.pyenv"
        export PATH="$PYENV_ROOT/bin:$PATH"
        eval "$(pyenv init --path)"
        eval "$(pyenv init -)"
      4. Save the updated .zshrc
    8. Download and Set Up Google Cloud SDK
      1. In a browser, go to https://cloud.google.com/sdk/docs/install
      2. Choose Mac for your operating system
      3. If you are prompted to run a curl command, do that
    9. Verify the Installation Google Cloud SDK installed properly
      1. In a terminal, type the following
        gcloud version
      2. You should see something like this
        Google Cloud SDK 502.0.0
        bq 2.1.9
        core 2024.11.15
        gcloud-crc32c 1.0.0
        gsutil 5.31
      3. Modify .zshrc in your user root directory.
      4. Open .zshrc in a text editor
      5. Add the following in this file.
        # Google Cloud Configuration
        # The next line updates PATH for the Google Cloud SDK.
        if [ -f '/Users/gregpaskal/google-cloud-sdk/path.zsh.inc' ]; then . '/Users/gregpaskal/google-cloud-sdk/path.zsh.inc'; fi
        
        # The next line enables shell command completion for gcloud.
        if [ -f '/Users/gregpaskal/google-cloud-sdk/completion.zsh.inc' ]; then . '/Users/gregpaskal/google-cloud-sdk/completion.zsh.inc'; fi
      6. Save the updates
      7. Reboot your computer.
    10. Setup the Development Environment on your local machine
      1. In a terminal type the following to Initialize the Google Cloud SDK (if not already done):
        gcloud init
    11. Authenticate with Google Cloud: Ensure you are authenticated:
      1. In a terminal type the following
        gcloud auth login
      2. You will be prompted to authenticate into your Google account
    12. Ensure you have the right project set
      1. In a terminal type the following.
      2. Ensure you have the right project ID (e.g. hello-data-441915
        gcloud config set project hello-data-441915
      3. This should result in something like this in your terminal
        Updated property [core/project]
    13. Ensure the correct project ID is set
      1. In a terminal type the following
        gcloud config list
      2. You should see something like this in your terminal
        [core]
        account = gregpaskal@gmail.com
        disable_usage_reporting = True
        project = hello-data-441915
        Your active configuration is: [default]
    14. Grant the service account to have the necessary permissions to access the staging bucket
      1. In a terminal type the following command
        gcloud iam service-accounts list
      2. Look for something like the following in terminal
        DISPLAY NAME                        EMAIL                                                DISABLED
        Default compute service account     1010204813344-compute@developer.gserviceaccount.com  False
        App Engine default service account  hello-data-441915@appspot.gserviceaccount.com        False
      3. In the terminal, run the following command to grant the Storage Admin role to the service account
        gcloud projects add-iam-policy-binding hello-data-441915 --role roles/storage.admin --member serviceAccount:hello-data-441915@appspot.gserviceaccount.com
    15. Create a Hello Data Application (in VSCode)
      1. This is the structure of the project we are about to create
        Datalake-JAV-HelloData/
        ├── src/
        │   ├── main/
        │   │   ├── java/
        │   │   │   └── com/
        │   │   │       └── example/
        │   │   │           └── HelloDataApplication.java
        │   │   └── resources/
        │   │       └── application.properties
        └── pom.xml
        └── app.yaml
      2. Create a pom.xml file
      3. Create a file HelloDataApplication.java inside src/main/java/com/example with the following content
        package com.example;
        
        import com.google.cloud.storage.Blob;
        import com.google.cloud.storage.Bucket;
        import com.google.cloud.storage.Storage;
        import com.google.cloud.storage.StorageOptions;
        import org.springframework.boot.SpringApplication;
        import org.springframework.boot.autoconfigure.SpringBootApplication;
        import org.springframework.web.bind.annotation.GetMapping;
        import org.springframework.web.bind.annotation.PostMapping;
        import org.springframework.web.bind.annotation.RequestBody;
        import org.springframework.web.bind.annotation.RequestParam;
        import org.springframework.web.bind.annotation.RestController;
        
        import java.nio.charset.StandardCharsets;
        import java.util.ArrayList;
        import java.util.List;
        
        @SpringBootApplication
        public class HelloDataApplication {
            public static void main(String[] args) {
                SpringApplication.run(HelloDataApplication.class, args);
            }
        }
        
        @RestController
        class HelloDataController {
        
            private final Storage storage = StorageOptions.getDefaultInstance().getService();
            private final String bucketName = "YOUR_BUCKET_NAME";
        
            @PostMapping("/submit")
            public String submitData(@RequestBody String data) {
                Bucket bucket = storage.get(bucketName);
                Blob blob = bucket.create("data/" + System.currentTimeMillis() + ".txt", data.getBytes(StandardCharsets.UTF_8));
                return "Data submitted: " + data;
            }
        
            @GetMapping("/data")
            public List<String> getData() {
                Bucket bucket = storage.get(bucketName);
                List<String> dataList = new ArrayList<>();
                for (Blob blob : bucket.list(Storage.BlobListOption.prefix("data/")).iterateAll()) {
                    dataList.add(new String(blob.getContent(), StandardCharsets.UTF_8));
                }
                return dataList;
            }
        }
      4. Create a file application.properties inside src/main/resources with the following content
        spring.main.web-application-type=servlet
      5. In the root directory of your project, create a file named app.yaml with the following content
        runtime: java21
        entrypoint: 'java -jar target/hello-data-1.0-SNAPSHOT.jar'
        
        handlers:
        - url: /.*
          script: auto
    16. Build your project
      mvn clean package
    17. Deploy Your Application to App Engine (which resides on the Google Cloud)
      1. We will use the Google Cloud SDK which you installed earlier to perform this.
      2. In a terminal type the following
        gcloud app deploy
      3. Visit your deployed application: Once the deployment is complete, you'll get a URL to access your application. Visit this URL to see your "Hello Data" form. - https://hello-data-441915.uc.r.appspot.com/
    18. View and monitor your logs
      1. In a terminal, type the following
        gcloud app logs tail -s default

Setting up Google Cloud Buckets and Folders

A Google Cloud bucket is essentially a storage container for content. These buckets can have folders to organize items in them. For the Hello Data project, I created a hello_data_44195_bucket. At this point, I've learned it a good idea to use underscores in naming and typically keep things lowercase as well.

  1. In Google Cloud, ensure you have the right project selected (Hello Data)
  2. Click on the Hamburger menu and select "Cloud Storage"
  3. From the Cloud Storage screen, click "Buckets"
  4. Click the "Create" button
  5. Give your bucket a unique name like "hello_data_441915_bucket_v2" and "Continue" button
  6. Select the "us-central1" option and click "Continue" button
  7. Select "Set a default class" and "Standard" option and click "Continue" button
  8. Check "Enforce public access prevention on this bucket"
  9. Select "Uniform" option and click the "Continue" button
  10. Check "Soft delete policy" and "Use default retention duration" option
  11. Click the "Create" button.
  12. If you get a dialog regarding public access being prevented. Ensure "Enforce public access prevention" and click "Confirm" button.
  13. With your new bucket created, lets now add some folders.
  14. Click "Create Folder" and name it "javascript_transformer" and click the "Create" button
  15. Click "Create Folder" and name it "pre_processed" and click the "Create" button
  16. Click "Create Folder" and name it "schemas_bigquery" and click the "Create" button
  17. Click "Create Folder" and name it "temp" and click the "Create" button

Uploading data to Google Bucket - Manually

To begin testing your Data Lake solution, consider uploading some data manually. You can use this approach to ensure some basic moving parts are working.

  1. In Google Cloud, ensure you have the right project selected (Hello Data)
  2. Click on the Hamburger menu and select "Cloud Storage"
  3. From the Cloud Storage screen, click "Buckets"
  4. Click the the bucket you want to upload data into (e.g. hello_data_441915_bucket_v2)
  5. Click the "Upload" button
  6. Select a sample data csv file (e.g. test_results_01.csv)
  7. Verify csv file was uploaded

Uploading data to Google Bucket - Google Cloud SDK

We will now upload sample data to the Google Bucket using the Google Cloud SDK. This ensures your computer is configured correctly for this task.

  1. In a terminal type the following
    gcloud info | grep "Python"
  2. You should see some text like the following
    Python Version: [3.11.6 (main, Nov 29 2024, 05:43:21) [Clang 16.0.0 (clang-1600.0.26.4)]]
    Python Location: [/Users/gregpaskal/.pyenv/versions/3.11.6/bin/python3]
    Python PATH: [/Users/gregpaskal/google-cloud-sdk/lib/third_party:/Users/gregpaskal/google-cloud-sdk/lib:/Users/gregpaskal/.pyenv/versions/3.11.6/lib/python311.zip:/Users/gregpaskal/.pyenv/versions/3.11.6/lib/python3.11:/Users/gregpaskal/.pyenv/versions/3.11.6/lib/python3.11/lib-dynload]
  3. Use Google Cloud SDK to determine what buckets are available by typing the following in a terminal
    gsutil ls
  4. You should see something like the following
    gs://hello-data-441915.appspot.com/
    gs://staging.hello-data-441915.appspot.com/
  5. Now that you know the buckets, try to upload a sample file
    gsutil cp sample_data/test_results_02.csv gs://hello-data-441915.appspot.com/
  6. You should see something like the following
    Copying file://sample_data/test_results_02.csv [Content-Type=text/csv]...
    / [1 files][  676.0 B/  676.0 B]                                                
    Operation completed over 1 objects/676.0 B.
  7. Check your bucket on Google Cloud storage to confirm file uploaded.

Create a BigQuery Dataset and Table

  1. In the Google Cloud go to the Hamburger menu
  2. Look for BigQuery
  3. Look for your Project ID (e.g. hello-data-441915)
  4. Click the three dots next to it and select "Create dataset"
  5. Datashet ID: hello_data_dataset (Or your preferred dataset ID)
  6. Location type: Multi-region
  7. Multi-region: US (Multiple regions in United States)
  8. Leave everything else as is.
  9. Click "Create Dataset" button
  10. You should now see your dataset.
  11. Click three dots next to newly created dataset and select "Create Table"
  12. In "Source" section, select "Create table from" and choose "Empty table"
  13. In Destination section, Project should already be set (e.g. hello-data-441915)
  14. In Destination section, Dataset should already be set (e.g. hello_data_dataset)
  15. In Destination section, Table set this to a unique name (e.g. hello_data_json)
  16. In Destination section, Table Type should already be set to "Native table"
  17. Leave the rest as is
  18. Click "Create Table" button.

Data Processing Pipeline

We will now create a Data Processing Pipelines Using Google Cloud Dataflow. The purpose of this is to automate data ingestion, transformation, and loading of the data.

  1. In the Google Cloud go to the Hamburger menu
  2. Look for Dataflow (You may need to search for this if it's not under the hamburger menu
  3. Click on the three dots and select "Create Job from Template"
  4. Job Name: "hello_data_lc64"
  5. Regional endpoint: "us-central1 (Iowa)"
  6. Dataflow template: "Text Files on Cloud Storage to BigQuery"
  7. Source: "hello-data-441915.appspot.com/pre_processed/*.csv" or "hello-data-441915.appspot.com/pre_processed/*.txt" or "hello-data-441915.appspot.com/raw/test_text_01.json"
  8. Target: "hello-data-441915.appspot.com/BigQuery_Table_Schema/hello-data.json"
  9. BigQuery output table: "hello-data-441915:hello_data_dataset.hello_data_table"
  10. Temporary directory for BigQuery loading process: "hello-data-441915.appspot.com/temp/"
  11. Required Parameters, Temporary location: "hello-data-441915.appspot.com/temp/"
  12. Encryption: Google-managed encryption key (Selected)
  13. Dataflow Prime: Enable Dataflow Prime (unchecked)
  14. Click "Run Job"
  15. You should see a job graph
  • Performing more from the command line

    Once I got to this stage of working with the Google Cloud tools, I realized it was much easier to do much of this work from the command line. I am going to include a number of the files and commands I ran here that enabled me to work with both json based and csv based files.

    When working with JSON based data, these were common commands I used.

    # The following are commands you will use when interacting with Google Cloud Storage and BigQuery.
    
    # Copy JSON schema to Google Cloud Storage schemas_bigquery bucket
    gsutil cp sample_data/json_based_data_schema.json gs://hello_data_441915_bucket_v2/schemas_bigquery/
    
    # Copy JSON javascript transform to Google Cloud Storage javascript_transform bucket
    gsutil cp sample_data/json_based_data_javascript_transformer.js gs://hello_data_441915_bucket_v2/javascript_transformer/
    
    # Copy JSON test data 01 to Google Cloud Storage pre_processed bucket
    gsutil cp sample_data/json_based_data_sample_01.json gs://hello_data_441915_bucket_v2/pre_processed/
    
    # Copy JSON test data 02 to Google Cloud Storage pre_processed bucket
    gsutil cp sample_data/json_based_data_sample_02.json gs://hello_data_441915_bucket_v2/pre_processed/
    
    # Remove all files from Google Cloud Storage pre_processed bucket
    gsutil rm gs://hello_data_441915_bucket_v2/pre_processed/\*.\*
    
    # Kick off a Google Cloud Dataflow job to process the JSON data and load it into BigQuery
    gcloud dataflow jobs run hello_data_lc64 \
        --gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
        --region us-central1 \
        --parameters \
    inputFilePattern=gs://hello_data_441915_bucket_v2/pre_processed/\*.json,\
    JSONPath=gs://hello_data_441915_bucket_v2/schemas_bigquery/json_based_data_schema.json,\
    bigQueryLoadingTemporaryDirectory=gs://hello_data_441915_bucket_v2/temp/,\
    javascriptTextTransformGcsPath=gs://hello_data_441915_bucket_v2/javascript_transformer/json_based_data_javascript_transformer.js,\
    outputTable=hello-data-441915:hello_data_dataset.hello_data_json,\
    javascriptTextTransformFunctionName=process
    • Data Schema for this data "json_based_data_schema.json"
      {
        "BigQuery Schema": [
          {
            "name": "line",
            "type": "STRING",
            "mode": "REQUIRED"
          }
        ]
      }
    • Javascript transformer for this data "json_based_data_javascript_transformer.js"
      function process(inJson) {
          return JSON.stringify({ "line": inJson });
        }
    • JSON formatted data for ingestion
      {"line": "Hello, world!"}
      {"line": "Dataflow is great."}
      {"line": "Transform this text."}

    When working with csv based data, these were common commands I used.

    #The following are commands you will use when interacting with Google Cloud Storage and BigQuery.
    
    # Copy CSV schema to Google Cloud Storage schemas_bigquery bucket
    gsutil cp sample_data/csv_based_data_schema.json gs://hello_data_441915_bucket_v2/schemas_bigquery/
    
    # Copy CSV javascript ttransformer to Google Cloud Storage javascript_transform bucket
    gsutil cp sample_data/csv_based_data_javascript_transformer.js gs://hello_data_441915_bucket_v2/javascript_transformer/
    
    # Copy CSV test data 01 to Google Cloud Storage pre_processed bucket
    gsutil cp sample_data/csv_based_data_sample_01.csv gs://hello_data_441915_bucket_v2/pre_processed/
    
    # Copy CSV test data 02 to Google Cloud Storage pre_processed bucket
    gsutil cp sample_data/csv_based_data_sample_02.csv gs://hello_data_441915_bucket_v2/pre_processed/
    
    # Remove all files from Google Cloud Storage pre_processed bucket
    gsutil rm gs://hello_data_441915_bucket_v2/pre_processed/\*.\*
    
    # Kick off a Google Cloud Dataflow job to process the JSON data and load it into BigQuery
    gcloud dataflow jobs run hello_data_lc64 \
        --gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
        --region us-central1 \
        --parameters \
    inputFilePattern=gs://hello_data_441915_bucket_v2/pre_processed/\*.csv,\
    JSONPath=gs://hello_data_441915_bucket_v2/schemas_bigquery/csv_based_data_schema.json,\
    bigQueryLoadingTemporaryDirectory=gs://hello_data_441915_bucket_v2/temp/,\
    javascriptTextTransformGcsPath=gs://hello_data_441915_bucket_v2/javascript_transformer/csv_based_data_javascript_transformer.js,\
    outputTable=hello-data-441915:hello_data_dataset.hello_data_csv,\
    javascriptTextTransformFunctionName=process
    • Data Schema for this data "csv_based_data_schema.json"
      {
        "BigQuery Schema": [
          {
            "name": "action",
            "type": "STRING",
            "mode": "REQUIRED"
          },
          {
            "name": "id",
            "type": "INTEGER",
            "mode": "REQUIRED"
          },
          {
            "name": "test_run_id",
            "type": "INTEGER",
            "mode": "REQUIRED"
          },
          {
            "name": "test_suite_name",
            "type": "STRING",
            "mode": "REQUIRED"
          },
          {
            "name": "test_case_name",
            "type": "STRING",
            "mode": "REQUIRED"
          },
          {
            "name": "test_case_result",
            "type": "STRING",
            "mode": "REQUIRED"
          },
          {
            "name": "test_case_duration",
            "type": "FLOAT",
            "mode": "REQUIRED"
          }
        ]
      }
    • Javascript transformer for this data "csv_based_data_javascript_transformer.js"
      function process(inJson) {
        // Split the input string by comma (assuming CSV data)
        const parts = inJson.split(',');
      
        // Check if the first column is "action" or "skip" (case insensitive)
        const action = parts[0].toLowerCase();
        if (action === 'action' || action === 'skip') {
          return null; // Skip this row
        }
      
        // Create a JSON object with the correct keys and values
        return JSON.stringify({
          action: parts[0],
          id: parseInt(parts[1]),
          test_run_id: parseInt(parts[2]),
          test_suite_name: parts[3],
          test_case_name: parts[4],
          test_case_result: parts[5],
          test_case_duration: parseFloat(parts[6])
        });
      }
    • CSV formatted data for ingestion
      action,id,test_run_id,test_suite_name,test_case_name,test_case_result,test_case_duration
      execute,1,101,User Login,Verify Login with Valid Credentials,pass,5.1
      execute,2,101,User Login,Verify Login with Invalid Credentials,fail,2.3
      execute,3,102,Checkout Process,Verify Cart Addition,pass,3.8
      execute,4,102,Checkout Process,Verify Payment,pass,4.6
      execute,5,103,Search Functionality,Verify Search Results,fail,1.7

    Triggering Dataflow jobs automatically

    Eventually you are going to want to automate the process of ingesting the data uploaded to Google Cloud buckets and process that data into BigQuery. The following steps will walk you through that process.

    1. Create a Pub/Sub Topic
      1. Go to Pub/Sub at https://console.cloud.google.com/cloudpubsub/
      2. Click "Create Topic" button
      3. Topic Id: hello_data_topic
      4. Add a default subscription: Checked
      5. Use a schema: Unchecked
      6. Enable ingestion: Unchecked
      7. Enable message retention: Unchecked
      8. Export message data to BigQuery: Unchecked
      9. Backup message data to Cloud Storage: Unchecked
      10. Encryption
        1. Select "Google-managed encryption key"
      11. Click "Create" button
    2. Configure Cloud Storage to Publish Notifications to Pub/Sub
      1. It's important to know that notifications within Google Cloud can only (at this time) be added via a commandline. There is not GUI support for notifications. You can use a number of commands to work with notifications.
        1. List all notifications for a specific bucket
          gsutil notification list gs://hello_data_441915_bucket_v2
        2. Delete a notification from a specific bucket
          gsutil notification delete projects/_/buckets/hello_data_441915_bucket_v2/notificationConfigs/1
      2. In terminal, perform the following to create the notification.
        gsutil notification create -t hello_data_topic -f json gs://hello_data_441915_bucket_v2
      3. Verify the notification was created
        gsutil notification list gs://hello_data_441915_bucket_v2
      4. In a folder named "python-functions" create a file named "main.py" with the following contents
        import os
        from google.cloud import bigquery
        from google.cloud import storage
        
        PROJECT_ID = os.getenv('GCP_PROJECT')
        DATASET_ID = 'hello_data_dataset'
        TABLE_ID = 'hello_data_csv'
        TEMP_BUCKET = 'hello_data_441915_bucket_v2'
        TEMP_LOCATION = f'gs://{TEMP_BUCKET}/temp/'
        
        def trigger_dataflow(event, context):
            file_name = event['name']
            bucket_name = event['bucket']
        
            if file_name.startswith('pre_processed/') and file_name.endswith('.csv'):
                load_csv_to_bigquery(bucket_name, file_name)
        
        def load_csv_to_bigquery(bucket_name, file_name):
            client = bigquery.Client()
            table_ref = client.dataset(DATASET_ID).table(TABLE_ID)
        
            job_config = bigquery.LoadJobConfig(
                source_format=bigquery.SourceFormat.CSV,
                skip_leading_rows=1,
                autodetect=True,
                write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
            )
        
            uri = f'gs://{bucket_name}/{file_name}'
            load_job = client.load_table_from_uri(
                uri,
                table_ref,
                location='US',
                job_config=job_config,
            )
        
            load_job.result()  # Waits for the job to complete.
        
            print(f'Loaded {load_job.output_rows} rows into {DATASET_ID}:{TABLE_ID}.')
        
            delete_file(bucket_name, file_name)
        
        def delete_file(bucket_name, file_name):
            storage_client = storage.Client()
            bucket = storage_client.bucket(bucket_name)
            blob = bucket.blob(file_name)
            blob.delete()
            print(f'Deleted file: gs://{bucket_name}/{file_name}')
      5. In a folder named "python-functions" create a file named "requirements.txt" with the following contents
        google-cloud-bigquery
        google-cloud-storage
      6. Assign the Eventarc Service Agent Role
        gcloud projects add-iam-policy-binding hello-data-441915 \
            --member="serviceAccount:service-1010204813344@gcp-sa-eventarc.iam.gserviceaccount.com" \
            --role="roles/eventarc.serviceAgent"
      7. Grant the Pub/Sub Publisher Role
        gcloud projects add-iam-policy-binding hello-data-441915 \
            --member="serviceAccount:service-1010204813344@gs-project-accounts.iam.gserviceaccount.com" \
            --role="roles/pubsub.publisher"
      8. Run the Deployment Command
        gcloud functions deploy trigger_dataflow \
            --runtime python310 \
            --trigger-resource hello_data_441915_bucket_v2 \
            --trigger-event google.storage.object.finalize \
            --set-env-vars GCP_PROJECT=hello-data-441915 \
            --entry-point trigger_dataflow \
            --region us-central1
      9. Pick it up here - I am getting a schema mismatch from data uploaded to what BigQuery expects. I don't get this when I run the job manually so I suspect it might have to do with the python file. - GP 12/8/2024

    Visualizing the data using Google Data Studio

    Now that data is in BigQuery, we will do a few visualizations based on this data in Google Data Studio.

    1. Go to Looker Studio - https://lookerstudio.google.com/
    2. Click "Blank Report" button
    3. In Add Data to Report select "BigQuery"
    4. Click "My Projects" and select the following
      1. Project: "Hello Data"
      2. Dataset: "hello_data_dataset"
      3. Table: "hello_data_csv"
    5. Click "Add" button.
    6. When you see "You are about to add data to this report" click "Add to Report" button

    Now lets add a bar chart

    1. Click on "Add a chart" and select "Bar Chart"
    2. Size and locate the bar chart where you want it.
    3. Under "Chart" look for "Dimensions" and select "test_suite_name"
    4. Under "Chart" look for "Metrics" and click "Add Metric" and select "test_case_result"

    Querying a Data Lake

    Querying a data lake involves several steps to retrieve and process data stored in its raw, unstructured, or semi-structured format. Here’s a high-level overview of the process: 1. Data Ingestion

    • Data Collection: Data is collected from various sources such as databases, IoT devices, social media, logs, etc.
    • Storage: The raw data is ingested into the data lake, where it is stored in its original format without any transformation.

    2. Data Cataloging

    • Metadata Tagging: Metadata is added to the stored data to make it searchable and manageable.
    • Indexing: The data lake indexes the data to improve query performance and data retrieval speed.

    3. Data Querying

    • Query Execution: Users write queries to request specific data from the data lake.
    • Processing: The data lake processes the query, retrieving the relevant data based on the query parameters.
    • Transformation: The retrieved data may be transformed or processed further to fit the needs of the analysis or application.

    4. Data Analysis

    • Analytics Tools: Users can use various analytics tools and frameworks (e.g., Apache Spark, Hadoop) to analyze the data.
    • Visualization: The results of the analysis can be visualized using dashboards, reports, or other visualization tools.

    5. Data Consumption

    • Access: The processed and analyzed data is made available to end-users, applications, or other systems for further use.

    Metadata Tagging

    A data lake derives its metadata through a process known as metadata management, which involves capturing, cataloging, and organizing metadata about the ingested data. This metadata is crucial for making the data searchable, manageable, and useful for analysis. Here's how this process typically works:

    1. Data Ingestion

    When data is ingested into the data lake, metadata is often captured automatically. This can include:

    • Technical Metadata: Information about the data's format, size, creation date, and source.
    • Operational Metadata: Details about data processing events, such as when and how the data was ingested and any transformations applied.

    2. Metadata Cataloging

    Once the metadata is captured, it is cataloged and stored in a metadata repository. This repository is often referred to as a data catalog. Tools and frameworks like Apache Atlas, AWS Glue, or Google Cloud Data Catalog are commonly used for this purpose.

    3. Metadata Types

    The metadata captured can be broadly classified into three types:

    • Descriptive Metadata: Provides context about the data, such as its purpose, origin, and characteristics.
    • Structural Metadata: Describes the structure of the data, such as schema definitions, data types, and relationships between different data entities.
    • Administrative Metadata: Information about the data's management, including access permissions, usage policies, and audit logs.

    4. Metadata Enrichment

    In addition to automatically captured metadata, data lakes can also incorporate enriched metadata to provide more context and value:

    • Business Metadata: Tags, labels, and descriptions that align the data with business terms and definitions, making it easier for users to understand and use the data.
    • User-Generated Metadata: Annotations, comments, and ratings provided by users who interact with the data, contributing to collaborative data governance.

    5. Search and Discovery

    The metadata catalog allows users to search and discover data within the data lake. This includes:

    • Indexing: Creating indexes for metadata to enable fast search and retrieval.
    • Tagging: Associating tags with data sets to classify and group related data.
    • Querying: Enabling users to query the metadata catalog to find specific data sets based on their attributes.

    Google Cloud SDK Overview

    Google Cloud SDK is a collection of tools and libraries that allow you to interact with Google Cloud services directly from your command line. It includes tools like:

    • gcloud: The main CLI tool for interacting with various Google Cloud services.
    • gsutil: A CLI tool for working with Google Cloud Storage.
    • bq: A CLI tool for interacting with BigQuery.

    Google Cloud CLI (gcloud)

    Google Cloud CLI (gcloud) is the command-line interface that is part of the Google Cloud SDK. It allows you to manage and configure Google Cloud resources. Some common commands include:

    • gcloud init: Initializes the SDK, setting up authentication and configuration.
    • gcloud auth login: Authenticates your Google Cloud account.
    • gcloud config set project [PROJECT_ID]: Sets the default project.
    • gcloud app deploy: Deploys your application to Google App Engine.
    • gcloud compute instances list: Lists all Compute Engine instances in your project.

    Reference URL's

    Hello Data - Spring Boot Application - Home: https://hello-data-441915.uc.r.appspot.com/

    Hello Data - Spring Boot Application - Upload: https://hello-data-441915.uc.r.appspot.com/uploadForm

    Hello Data - Google Cloud - App Engine: https://console.cloud.google.com/appengine?referrer=search&project=hello-data-441915&serviceId=default

    Hello Data - Google Cloud - Welcome: https://console.cloud.google.com/welcome/new?pli=1&project=hello-data-441915

    Hello Data - Google Cloud - Console: https://console.cloud.google.com/billing/0147A5-3C03EE-E8744F?project=hello-data-441915

    Hello Data - Google Cloud - App Engine: https://console.cloud.google.com/appengine/start?project=hello-data-441915

    Hello Data - Google Cloud - Dashboard: https://console.cloud.google.com/home/dashboard?invt=Abiung&project=hello-data-441915

    Hello Data - Google Cloud - Dataflow: https://console.cloud.google.com/dataflow/jobs?referrer=search&project=hello-data-441915

    Hello Data - Google Cloud - Logs Explorer: https://console.cloud.google.com/logs/query;query=%2528logName%20%3D%20%22projects%2Fhello-data-441915%2Flogs%2Fcloudaudit.googleapis.com%252Factivity%22%20OR%20logName%20%3D%20%22projects%2Fhello-data-441915%2Flogs%2Fcloudaudit.googleapis.com%252Fdata_access%22%20OR%20labels.activity_type_name:*%2529;cursorTimestamp=2024-11-28T21:55:03.821902Z;duration=P7D?invt=Abhvpg&project=hello-data-441915&walkthrough_id=panels--logging--query

    Hello Data - Google Cloud - Cloud Storage - Buckets: https://console.cloud.google.com/storage/browser?invt=Abhvpg&project=hello-data-441915&prefix=&forceOnBucketsSortingFiltering=true

    Hello Data - Google Cloud - Cloud Storage - Monitoring: https://console.cloud.google.com/storage/monitoring?invt=Abhvpg&project=hello-data-441915

    Hello Data - Google Cloud - Looker Studio (same as Data Studio): https://lookerstudio.google.com/

    Hello Data - Google Cloud - My FIrst Report: https://lookerstudio.google.com/reporting/d86b0295-b7fb-4447-b1a8-d302228a35e1/page/QRUXE

    Cloud Run Functions - https://console.cloud.google.com/functions/details/us-central1/trigger_dataflow?project=hello-data-441915