I was surprised to see the difficulty level of the data engineer certification has increased in 2 years. It was the easiest when I first took it 4 years ago. It became manageably challenging when I took it 2 years ago. Today, I want to say it’s moderately difficult when I took it on 2022–05–10. I got at least 4 out of 50 questions wrong. Here’re what I captured:
- In a Cloud dataflow job, when you want to limit calling microservices, study DoFn.StartBundle to micro-batch calls to external services. GroupByKey or just calling the microservice in DoFn isn’t the way.
- How to meaningfully transform city name as a feature in training a BigQuery ML model for product recommendation in a e-commerce scenario. Should you transform the city name to regions 1 through 5 or one hot encoding the state while using city as binary? I wasn’t sure of the answer because using city as binary wasn’t clear to me.
- Learn the side input pattern for data enrichment. where is the side input PCollection stored and updated?
- Learn use cases of single-cluster vs multi-cluster routing options in BigTable for different application profiles. 1 example is adding an hourly analytical job on an existing high throughput BigTable instance containing a single cluster. The solution is to create the analytical application with a different application profile which routes to the additional BigTable cluster with single-cluster routing while the existing application uses the existing cluster for high reliability. Data in the 1st cluster is replicated to the 2nd cluster for the analytical application to query on.
- BigTable rowKey schema design considerations: how to avoid hot spots? Don’t just use timestamp. Put a zip code or a stock ticker in front of it. How to ensure efficient lookup of the latest stock ticker time series real time price? Create the table per ticker and design reversed timestamp as the rowKey is better than put all the time series rows in a single table.
- How to fix machine learning model underfitting? Introduce a more sophisticated model. How to fix overfitting? L1, L2 regularizations or just removing some unrelated features. Generally, more high quality training data increases model accuracy.
- Which of the following IAM strategy is better to allow a company’s department leads to create, update tables and give analysts read only? Creating a dataset per departments or creating a table per department? Datasets contain tables so you’d choose dataset. Assign BigQuery Data Viewer to analysts who only need to read but not modify the data.
- Performance troubleshooting of Cloud Dataflow: when jobs are running at 100% CPU on worker nodes, increase --max_num_workers in python or --maxNumWorkers in Java in the job to fix it.
- Cloud SQL instance in a single zone vs high availability in a region. High availability can recover from a zonal failure much better. Creating a replicated Cloud SQL instance from a zonal instance does not help with high availability.
- Study different use cases for Dataflow fixed window, sliding window, tumbling window. Sliding window is good for calculating moving averages. For example, window length of 1 hour to calculate moving average of the past 1 hour with a 5 minute interval. Fixed window==tumbling window is good for calculating max, min or average in each hour without overlaps.
- How to save on costs using Cloud Dataproc without data loss with preemptible nodes and graceful decommissioning. Know SSD is more expensive than standard disks. Creating compute engine instances and install Hadoop software for Spark jobs are usually bad.
- Learn Admin activity logs are retained for 400 days while data access audit logs are retained for 30 days. If auditors need to see the logs 6 months ago, export the logs using logs router’s sinks to a bucket in a different project to prevent losing the logs.
- Understand it’s unreasonable to move 2 PB of data from NAS on premises over a 20 MB/sec connection to GCS. Order transfer appliance.
- If you have 2 months left at the data center colocation rental to migrate Hadoop clusters like Spark or Hive, don’t bother to modernize to BigQuery or Dataflow but to migrate to Dataproc first.
- In most cases, if a realtime dashboard is a business requirement, any type of daily or on demand batch load jobs to BigQuery does not work. Prefer Dataflow streaming job from pub/sub to BigQuery with some sort of data transformation to filter invalid values or enhance data quality.
- Understand which database services are ACID compliant, which are columnar, how to write queries to optimize for cost and performance in BigQuery, Spanner.
- Learn how to backup BigQuery tables with snapshots.
- Learn how to use slots to overcome the default 2000 slots per project in an organization where each department has their own budgets.
- Understand model training performance basic. For example, when a training takes 2 days on a n1-standard-16, changing it to n1-highmem-16 won’t help much because the training is CPU bound, not memory. Memory bound training jobs usually fails with outOfMemoryException. GPU dramatically improves training performance and cost savings.
- How to use intent in DialogFlow for a low code quick solution use case. Building Tensorflow models takes longer with lots of python code.
- How to cluster a BigQuery table with order of columns in a partition table. See clustering partitioned tables. You don’t want to mess up the order of columns in a cluster to reduce the performance of SELECT colx,y,x FROM dataset.table WHERE col_a=’value_a’ AND col_b=’value_b’ AND col_c=’value_c’. Timestamp partitioned tables increase performance of queries of last x < 33 days.
- How to install Dataproc SparkJob dependencies in a VPC service control service perimeter without egress to the Internet? The startup-script in Dataproc worker nodes would copy dependencies from a bucket in the service perimeter to a local folder and executes them such as a bucket in the same project.
- In a migration scenario from a database on premises to BigQuery, what’s the best performance recommendation for the destination table schema where the sources tables are queried by SQL join statements? Study when to use nested and repeated columns in a denormalized schema. For example, denormalize singer’s name, gender to a nested column in the songs table. Migrating the tables without changing the schema to BigQuery violates denormalization.
- Understand how to use TRANSFORM in BigQuery to do ML CREATE MODEL. The TRANSFORM preprocessing is built into the model causing the ML.EVALUATE, ML.PREDICT to not need the TRANSFORM clause.
- How to design cloud composer environments in a large scale shared VPC production network. You don’t want to put cloud composer environments in the host project but the service project.
- To monitor a streaming dataflow job’s performance from pub/sub to cloud storage, check the monitoring metrics of the pub/sub subscription: subscription/num_undelivered_messages and the metric of the Dataflow instances’s storage: instance/storage/used_bytes. Indication of decreasing num_undelivered_messages and used_bytes on disk shows a functioning dataflow job. You need to know when there are increasing undelivered messages in the subscription, Dataflow job isn’t doing its work.
- Skills to troubleshoot Dataflow batch jobs are important. If you have not done hands-on work in Dataflow, I recommend some practice. If you are able to see the Java exceptions or Python errors in the Dataflow job’s page’s logs showing errors like in this Dataflow pipeline troubleshooting and debugging guide, you know the pipeline was able to start with the right permissions to create worker nodes already. It was the job’s Python or Java code executing on the workers causing failures.
- I was surprised to learn that besides implementing the Cloud pub/sub Kafka connector, you can implement Dataflow Apache beam Kafka IO with kafkaio.KafkaConsume(
kafka_config
) in python to directly consume Kafka messages. I forgot the exact question but Kafka IO was among the answers. - I was surprised to see a question about DLP Pseudonymization. I used DLP’s Deterministic encryption using AES-SIV and Format preserving encryption at work in 2021. I created unit tests in github to demonstrate how to tokenize SSN. The scenario was to encrypt PII data in BigQuery to prevent data analysts from peeking at them but need to restore them in a step in the later data processing pipeline. The correct answer is to use one of the 2 Pseudonymization techniques.