Datalake - Google Cloud Platform

What can we help you find?											Language Syntax XPath Examples						Random
Home	Testing	Automation	AI	IoT	OS	RF	Data	Development	References	Tools	Wisdom	Inspiration	Investing	History	Fun	POC	Help

Let's learn how to set up a Datalake using the Google Cloud Platform (GCP) that includes setting up the service and submitting/retrieving data via a web interface

The goals we have for this project are the following.

Set Up Your Google Cloud Project
1. Create a new project: Go to the Google Cloud Console and create a new project.
2. Enable necessary APIs: Enable the App Engine Admin API and any other APIs you might need for your project.
Create a Hello World Application
1. Set up your development environment: Install the Google Cloud SDK and initialize it with your project.
2. Create a simple Hello World application: You can use any programming language supported by GCP, such as Python, Node.js, Java,
Deploy the Application to App Engine
1. Create an App Engine application: Use the gcloud app create command to create an App Engine application.
2. Deploy your application: Use the gcloud app deploy command to deploy your Hello World application.
3. Test your application: Once deployed, you'll get a URL to access your application. You can visit this URL to see your "Hello, World!" message.
Submit and Retrieve Data via Web Interface
1. Create a web interface: You can create a simple HTML form to submit data and a page to display retrieved data.
2. Handle data submission: Use a POST endpoint in your application to handle data submission.
3. Store and retrieve data: You can use Google Cloud Storage or Firestore to store and retrieve data.

Project Architecture

Google Cloud Console:
- Purpose: Central management interface where you create and manage projects, enable APIs, and deploy services.
- Components:
  - Project Management
  - API & Services Dashboard
App Engine:
- Purpose: Hosts your Java-based "Hello Data" web application. Handles HTTP requests and serves the web interface.
- Components:
  - Environment Configuration (app.yaml)
  - Deployment Services
Java 21 Application:
- Purpose: Main application logic handling data submission and retrieval. Deployed on App Engine.
- Components:
  - Spring Boot Framework: Used for creating the RESTful API for data submission and retrieval.
  - Google Cloud Storage Client: Interacts with Google Cloud Storage for data storage and retrieval.
Google Cloud Storage:
- Purpose: Acts as the data lake, storing raw data files of various types.
- Components:
  - Storage Buckets: Containers for your data.
  - Object Storage: Stores the actual data files (blobs).
Data Visualization (Optional):
- Purpose: Visualize the data stored in your data lake.
- Components:
  - Google Data Studio: Tool for creating interactive dashboards.
  - Grafana: Tool for advanced data visualization (requires setup).

Developing the Project

Accessing Google Cloud
1. https://console.cloud.google.com/
2. Create a New Project
  1. I named mine "Hello Data"
3. Enable Required API's
  1. Go to the Hamburger Menu
  2. Select "API's and Services"
    1. Enables APIs and services in Google Cloud and authenticate so your application to use them.
  3. Click "Enable API's and Services"
  4. Search for "App Engine Admin API" then click and enable this API
    1. Necessary for certain operations related to managing and deploying applications on Google App Engine.
  5. Search for "Cloud Storage API" then click and enable this API
    1. This allows storage and retrieve of large amounts of data in the cloud.
  6. Search for "Dataflow API" then click and enable this API
    1. Necessary for using Google Cloud Dataflow, which is a service for executing Apache Beam pipelines.
4. Create Service Account Keys
  1. Go to URL: https://console.cloud.google.com/projectselector2/iam-admin/serviceaccounts?supportedpurview=project
  2. Select "Hello Data" project
  3. Find the service account "hello-data-441915@appspot.gserviceaccount.com"
  4. Click the three dots under "Action" and select "Manage Keys"
  5. Click button "Add Key" and select "Create New Key"
  6. Select Key Type "JSON" and click the "Create" button
  7. A dialog will come up to save the key. Save it to the Desktop for now. This is the "Private" key
  8. Create a .private folder in the root of your project and move this .json file into that folder
  9. Update your .gitignore file to include the following line. THis ensure this private key will not be deployed to GitHub
```
# Ignore .private folder
.private
```
  10. Copy the private key from your local computer to the Google Cloud Storage Bucket
```
gsutil cp .private/hello-data-441915-d2e498612496.json gs://hello_data_441915_bucket/.private/
```
5. Create Storage Bucket
  1. Go to the Hamburger Menu
  2. Select "Cloud Storage"
  3. Click "Buckets"
  4. Click "Create" to make a new bucket.
    1. Name: hello-data
    2. Storage Class: Standard
    3. Location: us-central1
    4. Public Access: Subject to object ACLs
    5. Protection: Soft Delete
6. Install and Configure Python (Required for Google Cloud SDK)
  1. Homebrew should already be installed on your Mac
  2. In a terminal type "brew install pyenv"
  3. In a terminal type "pyenv install 3.11.6" which is the current required version of Python for Google Cloud SDK
  4. In a terminal type "pyenv local 3.11.6" to ensure you are using the correct version fo Python
  5. In a terminal type "python --version" and versify you get "Python 3.11.6"
7. Configure .zshrc to support the newly installed Python
  1. Modify .zshrc in your user root directory.
  2. Open .zshrc in a text editor
  3. Add the following in this file.
```
# Python Configuration - # Pyenv initialization
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init --path)"
eval "$(pyenv init -)"
```
  4. Save the updated .zshrc
8. Download and Set Up Google Cloud SDK
  1. In a browser, go to https://cloud.google.com/sdk/docs/install
  2. Choose Mac for your operating system
  3. If you are prompted to run a curl command, do that
9. Verify the Installation Google Cloud SDK installed properly
  1. In a terminal, type the following
```
gcloud version
```
  2. You should see something like this
```
Google Cloud SDK 502.0.0
bq 2.1.9
core 2024.11.15
gcloud-crc32c 1.0.0
gsutil 5.31
```
  3. Modify .zshrc in your user root directory.
  4. Open .zshrc in a text editor
  5. Add the following in this file.
```
# Google Cloud Configuration
# The next line updates PATH for the Google Cloud SDK.
if [ -f '/Users/gregpaskal/google-cloud-sdk/path.zsh.inc' ]; then . '/Users/gregpaskal/google-cloud-sdk/path.zsh.inc'; fi

# The next line enables shell command completion for gcloud.
if [ -f '/Users/gregpaskal/google-cloud-sdk/completion.zsh.inc' ]; then . '/Users/gregpaskal/google-cloud-sdk/completion.zsh.inc'; fi
```
  6. Save the updates
  7. Reboot your computer.
10. Setup the Development Environment on your local machine
  1. In a terminal type the following to Initialize the Google Cloud SDK (if not already done):
```
gcloud init
```
11. Authenticate with Google Cloud: Ensure you are authenticated:
  1. In a terminal type the following
```
gcloud auth login
```
  2. You will be prompted to authenticate into your Google account
12. Ensure you have the right project set
  1. In a terminal type the following.
  2. Ensure you have the right project ID (e.g. hello-data-441915
```
gcloud config set project hello-data-441915
```
  3. This should result in something like this in your terminal
```
Updated property [core/project]
```
13. Ensure the correct project ID is set
  1. In a terminal type the following
```
gcloud config list
```
  2. You should see something like this in your terminal
```
[core]
account = gregpaskal@gmail.com
disable_usage_reporting = True
project = hello-data-441915
Your active configuration is: [default]
```
14. Grant the service account to have the necessary permissions to access the staging bucket
  1. In a terminal type the following command
```
gcloud iam service-accounts list
```
  2. Look for something like the following in terminal
```
DISPLAY NAME                        EMAIL                                                DISABLED
Default compute service account     1010204813344-compute@developer.gserviceaccount.com  False
App Engine default service account  hello-data-441915@appspot.gserviceaccount.com        False
```
  3. In the terminal, run the following command to grant the Storage Admin role to the service account
```
gcloud projects add-iam-policy-binding hello-data-441915 --role roles/storage.admin --member serviceAccount:hello-data-441915@appspot.gserviceaccount.com
```
15. Create a Hello Data Application (in VSCode)
  1. This is the structure of the project we are about to create
```
Datalake-JAV-HelloData/
├── src/
│   ├── main/
│   │   ├── java/
│   │   │   └── com/
│   │   │       └── example/
│   │   │           └── HelloDataApplication.java
│   │   └── resources/
│   │       └── application.properties
└── pom.xml
└── app.yaml
```
  2. Create a pom.xml file
  3. Create a file HelloDataApplication.java inside src/main/java/com/example with the following content
```
package com.example;

import com.google.cloud.storage.Blob;
import com.google.cloud.storage.Bucket;
import com.google.cloud.storage.Storage;
import com.google.cloud.storage.StorageOptions;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;

import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;

@SpringBootApplication
public class HelloDataApplication {
    public static void main(String[] args) {
        SpringApplication.run(HelloDataApplication.class, args);
    }
}

@RestController
class HelloDataController {

    private final Storage storage = StorageOptions.getDefaultInstance().getService();
    private final String bucketName = "YOUR_BUCKET_NAME";

    @PostMapping("/submit")
    public String submitData(@RequestBody String data) {
        Bucket bucket = storage.get(bucketName);
        Blob blob = bucket.create("data/" + System.currentTimeMillis() + ".txt", data.getBytes(StandardCharsets.UTF_8));
        return "Data submitted: " + data;
    }

    @GetMapping("/data")
    public List<String> getData() {
        Bucket bucket = storage.get(bucketName);
        List<String> dataList = new ArrayList<>();
        for (Blob blob : bucket.list(Storage.BlobListOption.prefix("data/")).iterateAll()) {
            dataList.add(new String(blob.getContent(), StandardCharsets.UTF_8));
        }
        return dataList;
    }
}
```
  4. Create a file application.properties inside src/main/resources with the following content
```
spring.main.web-application-type=servlet
```
  5. In the root directory of your project, create a file named app.yaml with the following content
```
runtime: java21
entrypoint: 'java -jar target/hello-data-1.0-SNAPSHOT.jar'

handlers:
- url: /.*
  script: auto
```
16. Build your project
```
mvn clean package
```
17. Deploy Your Application to App Engine (which resides on the Google Cloud)
  1. We will use the Google Cloud SDK which you installed earlier to perform this.
  2. In a terminal type the following
```
gcloud app deploy
```
  3. Visit your deployed application: Once the deployment is complete, you'll get a URL to access your application. Visit this URL to see your "Hello Data" form. - https://hello-data-441915.uc.r.appspot.com/
18. View and monitor your logs
  1. In a terminal, type the following
```
gcloud app logs tail -s default
```

Setting up Google Cloud Buckets and Folders

A Google Cloud bucket is essentially a storage container for content. These buckets can have folders to organize items in them. For the Hello Data project, I created a hello_data_44195_bucket. At this point, I've learned it a good idea to use underscores in naming and typically keep things lowercase as well.

In Google Cloud, ensure you have the right project selected (Hello Data)
Click on the Hamburger menu and select "Cloud Storage"
From the Cloud Storage screen, click "Buckets"
Click the "Create" button
Give your bucket a unique name like "hello_data_441915_bucket_v2" and "Continue" button
Select the "us-central1" option and click "Continue" button
Select "Set a default class" and "Standard" option and click "Continue" button
Check "Enforce public access prevention on this bucket"
Select "Uniform" option and click the "Continue" button
Check "Soft delete policy" and "Use default retention duration" option
Click the "Create" button.
If you get a dialog regarding public access being prevented. Ensure "Enforce public access prevention" and click "Confirm" button.
With your new bucket created, lets now add some folders.
Click "Create Folder" and name it "javascript_transformer" and click the "Create" button
Click "Create Folder" and name it "pre_processed" and click the "Create" button
Click "Create Folder" and name it "schemas_bigquery" and click the "Create" button
Click "Create Folder" and name it "temp" and click the "Create" button

Uploading data to Google Bucket - Manually

To begin testing your Data Lake solution, consider uploading some data manually. You can use this approach to ensure some basic moving parts are working.

In Google Cloud, ensure you have the right project selected (Hello Data)
Click on the Hamburger menu and select "Cloud Storage"
From the Cloud Storage screen, click "Buckets"
Click the the bucket you want to upload data into (e.g. hello_data_441915_bucket_v2)
Click the "Upload" button
Select a sample data csv file (e.g. test_results_01.csv)
Verify csv file was uploaded

Uploading data to Google Bucket - Google Cloud SDK

We will now upload sample data to the Google Bucket using the Google Cloud SDK. This ensures your computer is configured correctly for this task.

In a terminal type the following
```
gcloud info | grep "Python"
```

You should see some text like the following

Python Version: [3.11.6 (main, Nov 29 2024, 05:43:21) [Clang 16.0.0 (clang-1600.0.26.4)]]
Python Location: [/Users/gregpaskal/.pyenv/versions/3.11.6/bin/python3]
Python PATH: [/Users/gregpaskal/google-cloud-sdk/lib/third_party:/Users/gregpaskal/google-cloud-sdk/lib:/Users/gregpaskal/.pyenv/versions/3.11.6/lib/python311.zip:/Users/gregpaskal/.pyenv/versions/3.11.6/lib/python3.11:/Users/gregpaskal/.pyenv/versions/3.11.6/lib/python3.11/lib-dynload]

Use Google Cloud SDK to determine what buckets are available by typing the following in a terminal
```
gsutil ls
```

You should see something like the following

gs://hello-data-441915.appspot.com/
gs://staging.hello-data-441915.appspot.com/

Now that you know the buckets, try to upload a sample file

gsutil cp sample_data/test_results_02.csv gs://hello-data-441915.appspot.com/

You should see something like the following

Copying file://sample_data/test_results_02.csv [Content-Type=text/csv]...
/ [1 files][  676.0 B/  676.0 B]                                                
Operation completed over 1 objects/676.0 B.

Check your bucket on Google Cloud storage to confirm file uploaded.

Create a BigQuery Dataset and Table

In the Google Cloud go to the Hamburger menu
Look for BigQuery
Look for your Project ID (e.g. hello-data-441915)
Click the three dots next to it and select "Create dataset"
Datashet ID: hello_data_dataset (Or your preferred dataset ID)
Location type: Multi-region
Multi-region: US (Multiple regions in United States)
Leave everything else as is.
Click "Create Dataset" button
You should now see your dataset.
Click three dots next to newly created dataset and select "Create Table"
In "Source" section, select "Create table from" and choose "Empty table"
In Destination section, Project should already be set (e.g. hello-data-441915)
In Destination section, Dataset should already be set (e.g. hello_data_dataset)
In Destination section, Table set this to a unique name (e.g. hello_data_json)
In Destination section, Table Type should already be set to "Native table"
Leave the rest as is
Click "Create Table" button.

Data Processing Pipeline

We will now create a Data Processing Pipelines Using Google Cloud Dataflow. The purpose of this is to automate data ingestion, transformation, and loading of the data.

In the Google Cloud go to the Hamburger menu
Look for Dataflow (You may need to search for this if it's not under the hamburger menu
Click on the three dots and select "Create Job from Template"
Job Name: "hello_data_lc64"
Regional endpoint: "us-central1 (Iowa)"
Dataflow template: "Text Files on Cloud Storage to BigQuery"
Source: "hello-data-441915.appspot.com/pre_processed/*.csv" or "hello-data-441915.appspot.com/pre_processed/*.txt" or "hello-data-441915.appspot.com/raw/test_text_01.json"
Target: "hello-data-441915.appspot.com/BigQuery_Table_Schema/hello-data.json"
BigQuery output table: "hello-data-441915:hello_data_dataset.hello_data_table"
Temporary directory for BigQuery loading process: "hello-data-441915.appspot.com/temp/"
Required Parameters, Temporary location: "hello-data-441915.appspot.com/temp/"
Encryption: Google-managed encryption key (Selected)
Dataflow Prime: Enable Dataflow Prime (unchecked)
Click "Run Job"
You should see a job graph

Performing more from the command line

Once I got to this stage of working with the Google Cloud tools, I realized it was much easier to do much of this work from the command line. I am going to include a number of the files and commands I ran here that enabled me to work with both json based and csv based files.

When working with JSON based data, these were common commands I used.

# The following are commands you will use when interacting with Google Cloud Storage and BigQuery.

# Copy JSON schema to Google Cloud Storage schemas_bigquery bucket
gsutil cp sample_data/json_based_data_schema.json gs://hello_data_441915_bucket_v2/schemas_bigquery/

# Copy JSON javascript transform to Google Cloud Storage javascript_transform bucket
gsutil cp sample_data/json_based_data_javascript_transformer.js gs://hello_data_441915_bucket_v2/javascript_transformer/

# Copy JSON test data 01 to Google Cloud Storage pre_processed bucket
gsutil cp sample_data/json_based_data_sample_01.json gs://hello_data_441915_bucket_v2/pre_processed/

# Copy JSON test data 02 to Google Cloud Storage pre_processed bucket
gsutil cp sample_data/json_based_data_sample_02.json gs://hello_data_441915_bucket_v2/pre_processed/

# Remove all files from Google Cloud Storage pre_processed bucket
gsutil rm gs://hello_data_441915_bucket_v2/pre_processed/\*.\*

# Kick off a Google Cloud Dataflow job to process the JSON data and load it into BigQuery
gcloud dataflow jobs run hello_data_lc64 \
    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
    --region us-central1 \
    --parameters \
inputFilePattern=gs://hello_data_441915_bucket_v2/pre_processed/\*.json,\
JSONPath=gs://hello_data_441915_bucket_v2/schemas_bigquery/json_based_data_schema.json,\
bigQueryLoadingTemporaryDirectory=gs://hello_data_441915_bucket_v2/temp/,\
javascriptTextTransformGcsPath=gs://hello_data_441915_bucket_v2/javascript_transformer/json_based_data_javascript_transformer.js,\
outputTable=hello-data-441915:hello_data_dataset.hello_data_json,\
javascriptTextTransformFunctionName=process

Data Schema for this data "json_based_data_schema.json"

{
  "BigQuery Schema": [
    {
      "name": "line",
      "type": "STRING",
      "mode": "REQUIRED"
    }
  ]
}

Javascript transformer for this data "json_based_data_javascript_transformer.js"
```
function process(inJson) {
    return JSON.stringify({ "line": inJson });
  }
```

JSON formatted data for ingestion

{"line": "Hello, world!"}
{"line": "Dataflow is great."}
{"line": "Transform this text."}

When working with csv based data, these were common commands I used.

#The following are commands you will use when interacting with Google Cloud Storage and BigQuery.

# Copy CSV schema to Google Cloud Storage schemas_bigquery bucket
gsutil cp sample_data/csv_based_data_schema.json gs://hello_data_441915_bucket_v2/schemas_bigquery/

# Copy CSV javascript ttransformer to Google Cloud Storage javascript_transform bucket
gsutil cp sample_data/csv_based_data_javascript_transformer.js gs://hello_data_441915_bucket_v2/javascript_transformer/

# Copy CSV test data 01 to Google Cloud Storage pre_processed bucket
gsutil cp sample_data/csv_based_data_sample_01.csv gs://hello_data_441915_bucket_v2/pre_processed/

# Copy CSV test data 02 to Google Cloud Storage pre_processed bucket
gsutil cp sample_data/csv_based_data_sample_02.csv gs://hello_data_441915_bucket_v2/pre_processed/

# Remove all files from Google Cloud Storage pre_processed bucket
gsutil rm gs://hello_data_441915_bucket_v2/pre_processed/\*.\*

# Kick off a Google Cloud Dataflow job to process the JSON data and load it into BigQuery
gcloud dataflow jobs run hello_data_lc64 \
    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
    --region us-central1 \
    --parameters \
inputFilePattern=gs://hello_data_441915_bucket_v2/pre_processed/\*.csv,\
JSONPath=gs://hello_data_441915_bucket_v2/schemas_bigquery/csv_based_data_schema.json,\
bigQueryLoadingTemporaryDirectory=gs://hello_data_441915_bucket_v2/temp/,\
javascriptTextTransformGcsPath=gs://hello_data_441915_bucket_v2/javascript_transformer/csv_based_data_javascript_transformer.js,\
outputTable=hello-data-441915:hello_data_dataset.hello_data_csv,\
javascriptTextTransformFunctionName=process

Data Schema for this data "csv_based_data_schema.json"

{
  "BigQuery Schema": [
    {
      "name": "action",
      "type": "STRING",
      "mode": "REQUIRED"
    },
    {
      "name": "id",
      "type": "INTEGER",
      "mode": "REQUIRED"
    },
    {
      "name": "test_run_id",
      "type": "INTEGER",
      "mode": "REQUIRED"
    },
    {
      "name": "test_suite_name",
      "type": "STRING",
      "mode": "REQUIRED"
    },
    {
      "name": "test_case_name",
      "type": "STRING",
      "mode": "REQUIRED"
    },
    {
      "name": "test_case_result",
      "type": "STRING",
      "mode": "REQUIRED"
    },
    {
      "name": "test_case_duration",
      "type": "FLOAT",
      "mode": "REQUIRED"
    }
  ]
}

Javascript transformer for this data "csv_based_data_javascript_transformer.js"

function process(inJson) {
  // Split the input string by comma (assuming CSV data)
  const parts = inJson.split(',');

  // Check if the first column is "action" or "skip" (case insensitive)
  const action = parts[0].toLowerCase();
  if (action === 'action' || action === 'skip') {
    return null; // Skip this row
  }

  // Create a JSON object with the correct keys and values
  return JSON.stringify({
    action: parts[0],
    id: parseInt(parts[1]),
    test_run_id: parseInt(parts[2]),
    test_suite_name: parts[3],
    test_case_name: parts[4],
    test_case_result: parts[5],
    test_case_duration: parseFloat(parts[6])
  });
}

CSV formatted data for ingestion

action,id,test_run_id,test_suite_name,test_case_name,test_case_result,test_case_duration
execute,1,101,User Login,Verify Login with Valid Credentials,pass,5.1
execute,2,101,User Login,Verify Login with Invalid Credentials,fail,2.3
execute,3,102,Checkout Process,Verify Cart Addition,pass,3.8
execute,4,102,Checkout Process,Verify Payment,pass,4.6
execute,5,103,Search Functionality,Verify Search Results,fail,1.7

Triggering Dataflow jobs automatically

Eventually you are going to want to automate the process of ingesting the data uploaded to Google Cloud buckets and process that data into BigQuery. The following steps will walk you through that process.

Create a Pub/Sub Topic
1. Go to Pub/Sub at https://console.cloud.google.com/cloudpubsub/
2. Click "Create Topic" button
3. Topic Id: hello_data_topic
4. Add a default subscription: Checked
5. Use a schema: Unchecked
6. Enable ingestion: Unchecked
7. Enable message retention: Unchecked
8. Export message data to BigQuery: Unchecked
9. Backup message data to Cloud Storage: Unchecked
10. Encryption
  1. Select "Google-managed encryption key"
11. Click "Create" button

Configure Cloud Storage to Publish Notifications to Pub/Sub

It's important to know that notifications within Google Cloud can only (at this time) be added via a commandline. There is not GUI support for notifications. You can use a number of commands to work with notifications.
1. List all notifications for a specific bucket
```
gsutil notification list gs://hello_data_441915_bucket_v2
```
2. Delete a notification from a specific bucket
```
gsutil notification delete projects/_/buckets/hello_data_441915_bucket_v2/notificationConfigs/1
```

In terminal, perform the following to create the notification.

gsutil notification create -t hello_data_topic -f json gs://hello_data_441915_bucket_v2

Verify the notification was created

gsutil notification list gs://hello_data_441915_bucket_v2

In a folder named "python-functions" create a file named "main.py" with the following contents

import os
from google.cloud import bigquery
from google.cloud import storage

PROJECT_ID = os.getenv('GCP_PROJECT')
DATASET_ID = 'hello_data_dataset'
TABLE_ID = 'hello_data_csv'
TEMP_BUCKET = 'hello_data_441915_bucket_v2'
TEMP_LOCATION = f'gs://{TEMP_BUCKET}/temp/'

def trigger_dataflow(event, context):
    file_name = event['name']
    bucket_name = event['bucket']

    if file_name.startswith('pre_processed/') and file_name.endswith('.csv'):
        load_csv_to_bigquery(bucket_name, file_name)

def load_csv_to_bigquery(bucket_name, file_name):
    client = bigquery.Client()
    table_ref = client.dataset(DATASET_ID).table(TABLE_ID)

    job_config = bigquery.LoadJobConfig(
        source_format=bigquery.SourceFormat.CSV,
        skip_leading_rows=1,
        autodetect=True,
        write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
    )

    uri = f'gs://{bucket_name}/{file_name}'
    load_job = client.load_table_from_uri(
        uri,
        table_ref,
        location='US',
        job_config=job_config,
    )

    load_job.result()  # Waits for the job to complete.

    print(f'Loaded {load_job.output_rows} rows into {DATASET_ID}:{TABLE_ID}.')

    delete_file(bucket_name, file_name)

def delete_file(bucket_name, file_name):
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(file_name)
    blob.delete()
    print(f'Deleted file: gs://{bucket_name}/{file_name}')

In a folder named "python-functions" create a file named "requirements.txt" with the following contents
```
google-cloud-bigquery
google-cloud-storage
```

Assign the Eventarc Service Agent Role

gcloud projects add-iam-policy-binding hello-data-441915 \
    --member="serviceAccount:service-1010204813344@gcp-sa-eventarc.iam.gserviceaccount.com" \
    --role="roles/eventarc.serviceAgent"

Grant the Pub/Sub Publisher Role

gcloud projects add-iam-policy-binding hello-data-441915 \
    --member="serviceAccount:service-1010204813344@gs-project-accounts.iam.gserviceaccount.com" \
    --role="roles/pubsub.publisher"

Run the Deployment Command

gcloud functions deploy trigger_dataflow \
    --runtime python310 \
    --trigger-resource hello_data_441915_bucket_v2 \
    --trigger-event google.storage.object.finalize \
    --set-env-vars GCP_PROJECT=hello-data-441915 \
    --entry-point trigger_dataflow \
    --region us-central1

Pick it up here - I am getting a schema mismatch from data uploaded to what BigQuery expects. I don't get this when I run the job manually so I suspect it might have to do with the python file. - GP 12/8/2024

Visualizing the data using Google Data Studio

Now that data is in BigQuery, we will do a few visualizations based on this data in Google Data Studio.

Go to Looker Studio - https://lookerstudio.google.com/
Click "Blank Report" button
In Add Data to Report select "BigQuery"
Click "My Projects" and select the following
1. Project: "Hello Data"
2. Dataset: "hello_data_dataset"
3. Table: "hello_data_csv"
Click "Add" button.
When you see "You are about to add data to this report" click "Add to Report" button

Now lets add a bar chart

Click on "Add a chart" and select "Bar Chart"
Size and locate the bar chart where you want it.
Under "Chart" look for "Dimensions" and select "test_suite_name"
Under "Chart" look for "Metrics" and click "Add Metric" and select "test_case_result"

Querying a Data Lake

Querying a data lake involves several steps to retrieve and process data stored in its raw, unstructured, or semi-structured format. Here’s a high-level overview of the process: 1. Data Ingestion

Data Collection: Data is collected from various sources such as databases, IoT devices, social media, logs, etc.
Storage: The raw data is ingested into the data lake, where it is stored in its original format without any transformation.

2. Data Cataloging

Metadata Tagging: Metadata is added to the stored data to make it searchable and manageable.
Indexing: The data lake indexes the data to improve query performance and data retrieval speed.

3. Data Querying

Query Execution: Users write queries to request specific data from the data lake.
Processing: The data lake processes the query, retrieving the relevant data based on the query parameters.
Transformation: The retrieved data may be transformed or processed further to fit the needs of the analysis or application.

4. Data Analysis

Analytics Tools: Users can use various analytics tools and frameworks (e.g., Apache Spark, Hadoop) to analyze the data.
Visualization: The results of the analysis can be visualized using dashboards, reports, or other visualization tools.

5. Data Consumption

Access: The processed and analyzed data is made available to end-users, applications, or other systems for further use.

Metadata Tagging

A data lake derives its metadata through a process known as metadata management, which involves capturing, cataloging, and organizing metadata about the ingested data. This metadata is crucial for making the data searchable, manageable, and useful for analysis. Here's how this process typically works:

1. Data Ingestion

When data is ingested into the data lake, metadata is often captured automatically. This can include:

Technical Metadata: Information about the data's format, size, creation date, and source.
Operational Metadata: Details about data processing events, such as when and how the data was ingested and any transformations applied.

2. Metadata Cataloging

Once the metadata is captured, it is cataloged and stored in a metadata repository. This repository is often referred to as a data catalog. Tools and frameworks like Apache Atlas, AWS Glue, or Google Cloud Data Catalog are commonly used for this purpose.

3. Metadata Types

The metadata captured can be broadly classified into three types:

Descriptive Metadata: Provides context about the data, such as its purpose, origin, and characteristics.
Structural Metadata: Describes the structure of the data, such as schema definitions, data types, and relationships between different data entities.
Administrative Metadata: Information about the data's management, including access permissions, usage policies, and audit logs.

4. Metadata Enrichment

In addition to automatically captured metadata, data lakes can also incorporate enriched metadata to provide more context and value:

Business Metadata: Tags, labels, and descriptions that align the data with business terms and definitions, making it easier for users to understand and use the data.
User-Generated Metadata: Annotations, comments, and ratings provided by users who interact with the data, contributing to collaborative data governance.

5. Search and Discovery

The metadata catalog allows users to search and discover data within the data lake. This includes:

Indexing: Creating indexes for metadata to enable fast search and retrieval.
Tagging: Associating tags with data sets to classify and group related data.
Querying: Enabling users to query the metadata catalog to find specific data sets based on their attributes.

Google Cloud SDK Overview

Google Cloud SDK is a collection of tools and libraries that allow you to interact with Google Cloud services directly from your command line. It includes tools like:

gcloud: The main CLI tool for interacting with various Google Cloud services.
gsutil: A CLI tool for working with Google Cloud Storage.
bq: A CLI tool for interacting with BigQuery.

Google Cloud CLI (`gcloud`)

Google Cloud CLI (gcloud) is the command-line interface that is part of the Google Cloud SDK. It allows you to manage and configure Google Cloud resources. Some common commands include:

gcloud init: Initializes the SDK, setting up authentication and configuration.
gcloud auth login: Authenticates your Google Cloud account.
gcloud config set project [PROJECT_ID]: Sets the default project.
gcloud app deploy: Deploys your application to Google App Engine.
gcloud compute instances list: Lists all Compute Engine instances in your project.