2024 Google Cloud professional data engineer exam

Hil Liao
3 min readJul 4, 2024

--

  1. How does Authorized view different from Authorized dataset? How does it work with column level taxonomy access policy in the dataset? How does it work with CMEK datasets?
  2. When a disk I/O heavy Hadoop data processing job is migrated to DataProc, the job is slow processing data in GCS. What’s the solution? Create a local disk for the compute engine instances per https://cloud.google.com/dataproc/docs/concepts/compute/dataproc-local-ssds;
  3. When you run BigQuery SELECT query against an external BigQuery table of Optimized Row Columnar files well partitioned in 1000 files in GCS, you notice the Query has been slow without using JOIN but WHERE on the partition ID. How do you solve it? The solution is to change to the mapping of 1 external table to 1 GCS file and use wildcard table to query per https://cloud.google.com/bigquery/docs/querying-wildcard-tables;
  4. Understand the machine learning model training stages. For example, after you preprocessed the .csv files, the next step is to determine the training and evaluation split, not to evaluate, tune, or deploy the model.
  5. Study time based access pattern BigTable schema design per https://cloud.google.com/bigtable/docs/schema-design#time-based ; The scenario is to choose the right data product for storing IoT metrics emitted every second with 10 millisecond latency at high throughput. Prefer BigTable over BigQuery. Design the rowKey schema to be $IoT_ID#$TimeStamp, not $IoT_ID#$Metric because the access pattern is time based data retrieval. Don’t put $TimeStamp in the beginning of the rowKey.
  6. Know the steps to configure a dead letter topic in a subscription per https://cloud.google.com/pubsub/docs/handling-failures#dead_letter_topic. Understand the dead letter topic is a subscription property where a failing subscription publishes the message to the dead letter topic. Failing subscription means it can’t acknowledge the message. The scenario is similar to the following.
  7. Know the design common practice in Cloud Dataflow to handle failed transformation in a streaming pipeline. Use a try-catch or try-except exception to publish the failed DoFn transformation to a topic for manual error handling.
  8. Design change data capture (CDC) for migrating the current CDC process from a PostgreSQL on premises to BigQuery without public IP. Choose DataStream with private connectivity per https://cloud.google.com/datastream/docs/private-connectivity#overview;
  9. You noticed out of memory exception in processing batch Dataflow jobs. Solve it with vertical autoscaling and right fitting per https://cloud.google.com/dataflow/docs/guides/enable-dataflow-prime#vertical_autoscaling; increasing worker count does not help.
  10. Understand which ETL products are low code or no code such as DataFusion, Dataprep.
  11. Know the different scenarios for choosing tumbling window, hopping window, and session window in streaming Dataflow pipeline design. The scenario is to measure construction site noise by sending noise level integer number every minute. If the noise level is above the threshold for 30 minutes continuously with a 15 minute gap, prefer hopping window.
  12. Understand basic devops principles. The scenario is to design a CI,CD pipeline for deploying Python code to Cloud Composer 2. The correct deployment method is to copy the python files to the GCS path in the Composer instance, not to use GKE Pod operator to run Python code. The method to deploy needs to be consistent between development, testing, and production projects.
  13. You are migrating an existing PostgreSQL database to Google cloud. The method needs to be cost effective in a tight time line. how do you proceed? Use database migration service to migrate to Cloud SQL for postgreSQL. Don’t migrate to Cloud Spanner or BigTable.
  14. How do you migrate an Oracle database to PostgreSQL using database migration service. The Oracle database host has no public IP. Configure https://cloud.google.com/database-migration/docs/oracle-to-postgresql/private-connectivity to enable a reverse proxy.
  15. What’s the best tool for data analysts to build SQL workflows to achieve ETL if they don’t know Python or Java but best in SQL? Choose Dataform;
  16. Know the use pattern of Cloud SQL auth proxy. The scenario is about securing the connection to a Cloud SQL instance. Cloud SQL auth proxy does not require the Cloud SQL instance to add any IP to the authorized network ranges. The connection is encrypted. Although not seen in the exam, The Cloud SQL Auth Proxy does not provide a new connectivity path; it relies on existing IP connectivity. To connect to a Cloud SQL instance using private IP, the Cloud SQL Auth Proxy must be on a resource with access to the same VPC network as the instance.

--

--

No responses yet