Organizations often have to detect, redact, and sometimes encrypt Personally Identifiable Information (PII) or other sensitive data, such as credit card numbers, which would protect them against data exposure. If any part of the network is compromised, it will act as another safeguard which will keep the data redacted or encrypted. Google Cloud Platform’s Cloud Data Loss Prevention (DLP) API gives its clients an option to detect the presence of PII and other privacy-sensitive data in user-supplied, unstructured data streams, such as a paragraph, images, or audio recordings (which needs to be converted into text via Speech-to-Text API).

DLP API can classify and redact sensitive data. It supports several customizations, including regular expressions (regex), dictionaries, and other predefined detection rules. For DLP API to work, text or images should be provided, and it works on data already present on GCS, Big Query, and Cloud Storage.

DLP API includes an API with language-specific SDKs, customization support, the ability to redact and work on files on Google Cloud Storage (GCS) and Big Query, and operability on images.

Features of DLP API

DLP API has over 120 pre-build detectors (InfoType Detector), and organizations can create custom detectors for their specific use-case.
After detecting sensitive data, DLP API can redact, mask, tokenize, and transform text and images to ensure privacy.
DLP API is a managed service. GCP can scale DLP API according to the data input provided.
The API’s classification results can be sent directly to Big Query for detailed analysis,or exported to another environment.
Cloud DLP handles data securely and undergoes multiple independent third-party audits to test data safety, privacy, and security.

DLP Proxy Architecture

One way to remove PII data is to route all queries and results through a module that parses, inspects, and logs the findings, and de-identifies those results using Cloud DLP, before returning the requested data or forwarding it to the next step. This module or service is termed a DLP Proxy.
The DLP proxy application accepts an SQL query as input, runs that query on the database, and then applies Cloud DLP to the results before returning them to the user requesting the data.

Fig: the architecture of the DLP proxy applicationCloud DLP allows detailed configuration of what types of data to inspect for and how to transform the data based on these inspection findings or data structures (like field names). To simplify the creation and management of the configuration, organizations can use Cloud DLP templates. The DLP proxy application references both inspect and de-identify templates.

Cloud Audit Logs is an integrated logging service from Google Cloud Platform used in the architecture shown above. Cloud Audit Logs provides an audit trail of calls made to the DLP API. The audit log entries include information about who made the API call, which Cloud project it was run against, and details about the request, including if a template was used as part of the request. If you use the application’s configuration file to turn on auditing, Cloud Audit Logs records a summary of the inspection findings.

Cloud Key Management Service (Cloud KMS)is a cloud-hosted key management service from Google Cloud that lets you manage your cloud services’ cryptographic keys.

Cloud DLP methods for tokenization and date shifting use cryptography to generate replacement values. These cryptographic methods use a key to encrypt those values consistently to maintain referential integrity or, for reversible processes, to detokenize. You can directly provide this key to Cloud DLP when the call is made, or you can wrap it by using Cloud KMS. Wrapping your key in Cloud KMS provides another layer of access control and auditing, and is therefore the preferred method for production deployments.

For production configuration, organizations should use the principle of least privilege to assign permissions. The following diagram would incorporate this principle.

architecture of dlp proxy application with least privilege block img

Fig: the architecture of the DLP proxy application with least privilegeThe preceding diagram shows how in a typical production configuration, there are three personas with different roles and access to the raw data:

Infrastructure adminInstalls and configures the proxy to access the Cloud DLP proxy’s compute environment.
Data analystAccesses the client that connects to the DLP proxy.
Security adminClassifies the data, creates the Cloud DLP templates, and configures Cloud KMS.

Conclusion

Google Cloud Platform’s Data Loss Protection API provides a service that can make organizations manage sensitive data, including detecting and redaction, masking, and tokenizing such data. This can help organizations comply with regulations such as GDPR, and reduce the risk of data exposure and data breaches.

To get hands-on experience on Google Cloud’s DLP API, try the website located here.