I've written this post, because I would like to draw your attention to some important points about the deleting entities in datastore through dataflow service. I think that the documentation page of the dataflow template for deleting entities isn't beneficial. Also it isn't up to date.
Job Common Properties
- First of all visit dataflow service page. Click the create job from template button. You should create a dataflow job from template. The template name is
Bulk Delete Entities in Datastore
. - Give a name to your job.
- Select an endpoint. Job metadata will be stored in here. Select same with datastore region.
Required Parameters
- Type your GQL query. Simply SELECT * FROM [kind_name]
- Type your read_project_id
- Type your delete_project_id. In this point you should ask, why I had given two project id. Dataflow performs read operations from your read_project_id, then deletes the entities from the delete_project_id.
- Temporary location is required to store some metadata and log variety things. Type a bucket path.
gs://temp-dataflow-delete/path/
You shouldn't need to define any additional parameters theoretically. Because you defined required parameters, right? But there is more.
- For example; we haven't defined the namespace? It will delete but from where?
ClickShow Optional Parameters
to define. - Type your datastore namespace to first field.
- UDF GCS path doesn't required at this point, leave it blank.
- Same for UDF function name, leave it blank.
- Max workers. It is most important point. However GCP didn't pay enough attention to this. As you can imagine, it limits the worker count.
Dataflow provisions VM instances in your account to perform read and delete operations. This parameter set the maximum number of workers within a node group. Specify as you wish. - Number of workers defines the initial number of workers. For example 1. Node group will scales the instances up to the Max workers.
- Select the worker region and zone. If you select same region with your datastore, it would be good. So you don't pay inter zone and region data transfer price.
- If you would like to associate a service account to workers, type the email address.
- Machine type is another important point. However they didn't pay enough attention to this parameter as always.
- Additional experiments isn't required. Leave it blank.
- Worker IP address configuration is important. If you don't have a special case, I suggest you to choose private. If you select the public, you will be charged. Because GCP suppose that it's giving a service to customer application or 3rd party. If you enabled
Private Google Access
in your VPC, then you will have an secure and high performant connection. Also you won't be charged. Yes again they didn't pay attention. :) I wonder why? :) - You can specify the VPC network
- Also subnetwork.
That's it. You can run your job safely from price perspective. Also I suggest to use the burstable instance types which are f1-micro and g1-small. If you don't have performance concern, because they are cheaper :)