In preparation for GDPR compliance, a global 100 financial services organization embarked on a journey to assess its core information processing environments with the objective of identifying opportunities to strengthen its privacy data protection programs. This article focuses on the technology challenges, approach, and lessons learned for the centralized testing environment.
Like many DevOps groups across the industry, this financial organization has adopted both continuous testing and quality testing regime to deliver quality products leveraging agile methodology. The organization prefers to use production data to prepare the test data. While majority testing is primarily be done by an internal team, certain applications are tested by outsourced offshore teams. The test environment is fairly complex comprising Oracle, Hadoop (Parquet files), Hive, Cassandra, MS SQL, SAS, Linux based system. Incremental data volume varies between 10 million to 15 million records on weekly basis. Certain major releases of big data based applications require up to 5 GB data ( ~ 75 million records).
In order to comply with the GDPR and prevent privacy data breach events, the testing team needed to detect and de-identify the PII element. If they use available de-identification methods of leveraging product specific encryption technology like MS SQL encryption etc., much of the data becomes unusable for testing for the following reasons:
- a) current methods scramble the data and make data unusable
- b) current methods do not preserve any referential relationship between various data sources.
If they choose to mask the data, they are challenged with similar challenges. For example, if they want to test an application that calculates the end of month summary balance of a customer account using an Oracle data source and Hadoop data source – they would not able to use the data encrypted using available technology.
In addition, PII information often appears within comments and description fields – encryption or masking of the entire field would result in loss of important information.
More importantly, data encryption using available methods are computationally time-consuming and requires large hardware infrastructure.
The organization identified the following solution criteria to mitigate the challenges identified during the assessment
- Autonomous Detection: Leveraging a centralized library, a solution should examine all incoming data including embedded documents for the presence of PII elements. Solution should also be using machine learning techniques to classify sensitive documents present in big data repository
- Format Preserving Encryption: Based on the type of PII data and preference of the user, the solution should encrypt the data elements in three following three modes:
- Blind mode: It should encrypt data element if the data element matches a specific regular expression
- Column mode: It should encrypt the content of a specific column or a field
- Mixed Mode: It should encrypt the data elements within a specific column if the data element matches a specific regular expression
- Cross Platform Referential Integrity: Solution must be able to retain referential integrity between records across platforms
- Big Data Volume: Solution should be able to detect and encrypt sensitive data in 100 GB of data in less than one hour using commodity hardware.
- Data Usage Monitoring: Solution should be able to record and retain information all privacy data usage for audit and compliance. In addition, the solution should be able to identify abnormal data usage leveraging machine learning.
- Understand business and technology landscape: It is imperative to understand the current technology landscape, business practices and emerging trends. If your technology platform and domain is monolithic today – do you expect it to remain monolithic in near future. What would be the impact should you move some of your testings to a cloud platform? What about big data applications?
- Evaluate risks: Assess data security risks through the lens of GDPR and beyond. In addition to the PII and PHI information, most organizations deal with a number of sensitive data that may not be associated with an individual. How to you detect, encrypt and monitor other types of sensitive data such as B2B contract information in your testing environment?
- Beyond Retrofitting: Define the ideal solution characteristics prior to evaluating solutions. Retrofitting a solution to meet your business needs is often time-consuming and costly.
To learn more please visit https://pricchaa.com/gdpr
In next part, we will share how this organization is planning to address GDPR compliance requirements when the testing is done by a third-party vendor.