I just took and passed the exam at online proctor.

Hil Liao

3 min readJun 29, 2020

per “Dataprep: jobs. How are Dataprep jobs created and run? What permissions do you need? “, I wonder if it has become less important to know permissions around Dataprep. I suppose there are 2 roles regarding dataprep: [agent, user].
You can’t avoid Cloud BigTable; it‘s important to know application profiles routing policy.
you can’t avoid Cloud Dataflow. Study how to resolve performance bottleneck when a streaming job caused the worker node’s CPU to be 100% such as by tuning the pipeline options.
I wonder if data studio has become less important compared to generic BI reporting tools; Study the difference between batch vs streaming in Cloud Dataflow; which makes data processing freshness near real-time for BI reporting charts.
Watch the vague word `Stackdriver agent`. It could mean the logging or monitoring agent. Study Stackdriver monitoring agent with a plugin; learn the feature of the plugins and why that’s the quickest way to get monitoring metric compared to custom metric, which require more code to create.
Learn when to use which Dataflow streaming pipeline windowing functions. I also believe the session window is an exam hot topic.
I probably failed on deep Apache Dataproc optimization questions; for example, how to shorten the processing time of a shuffling type of job with 300 MB ORC files on GCS. I chose using HDFS as the local temporary storage to avoid reading files from GCS, then write back to GCS. I still don’t know if that’s correct.
Cloud Dataprep compared to Dataflow, Dataproc, BigQuery is the solution for analysts without coding skills to do ETL. I suppose Cloud Data fusion would be another example.
Know about L1, L2 regularizations, linear, logistic regressions. I probably failed the question on how to improve area under curve.
learn different database services. Understand which ones support transactions, which are noSQL (no way to LEFT JOIN), which can autoscale storage, which ones are multi region, single region, or have the option to be zonal. Bare in mind that there is no node auto-scaling for BigTable or Cloud Spanner
learn Cloud SQL HA at https://cloud.google.com/sql/docs/mysql/high-availability to prevent zonal failures.
Understand Cloud SQL read replica common use cases, such as analytics traffic to the read replica. Understand invalid use cases at https://cloud.google.com/sql/docs/mysql/replication/tips such as not an option for HA or automatic fail-over.
Practice how to create Authorized view and datetime based partitioned tables.
I probably failed on monitor and track each project’s BigQuery slot used or allocated. study BigQuery optimization for sure.
I happened to work on a project that required creating aggregated log export to a GCS sink at a folder level. Otherwise, how can you export logs in all projects in a given folder?
I happened to architect a solution of how to publish messages from pub/sub to Kafka using source connector (part of Kafka connect). I Configured cps.subscription as the pub/sub subscription and kafka.topic as the topic at Kafka. Also learn how to duplicate or mirror messaging from an existing Kafka installation (in AWS or on premises) to pub/sub with Sink connector where cps.topic is the topic in pub/sub to publish.
Study how to use the dead letter queue concept to create a more robust, resilient Dataflow pipeline that has try catch logic in a step that pipes the malformatted data elements to the dead letter queue sink such as BigQuery.
Study common use cases of side inputs to enrich data. For example, use a lookup table as the side input to enrich data elements in the main pipeline.

Written by Hil Liao

No responses yet