Big Data is an emerging set of technologies enabling organizations a greater insight into their huge amount of data to drive better business decisions and greater customer satisfaction. The aggregation of data in Big data systems also make them an attractive target for hackers. Organizations should be able to handle this data efficiently and must protect sensitive customer data so as to comply with a set of privacy laws and compliance requirements. Securing big data is difficult because of multiple reasons.Some are mentioned below:
- There are multiple feeds of data in real time from different sources with different protection needs.
- There are multiple types of data combined together.
- The data is being accessed by many different users with various analytical requirements.
- Rapidly evolving tools funded by open source community.
- Automatic replication of data across multiple nodes.
There are multiple ways to protect data in a Hadoop environment:
- File system level encryption:
This encryption is commonly used to protect the sensitive information in the files and folders. This type of encryption is also known as “data at rest” encryption. Data is encrypted at the file level and is protected at rest residing on data stores. But this approach does not protect the data when it is running within the system. The data is automatically decrypted when it is read by the operating system and this data is fully exposed to any authorized or unauthorized user or process accessing the system.
- Database encryption:
File system level encryption can also be used to protect data stored in a database. There are multiple techniques available for database encryption including Transparent data encryption (TDE) and Column-level encryption. TDE is used to encrypt an entire database. Column-level encryption allows for encryption of individual columns in a database.
- Transport level encryption:
This encryption is used to protect data in motion using SSL/TLS protocols.
- Application level encryption:
This encryption uses APIs to protect data at the application side.
- Format preserving encryption:
FPE encrypts the data without changing the original data format. This allows the applications and databases to use the data. Data protection is applied at the field level which enables protecting the sensitive parts of the data and leaving the non-sensitive parts for applications.
As large volume of data from multiple sources like machine sensors, server logs and applications flow into the Hadoop Data Lake, it serves as a central repository to a broad and diverse set of data. The data lake needs to be protected with comprehensive security as it will store vital and often highly sensitive business data. Data can be protected at multiple stages in Hadoop (before entering the data lake, while entering the data lake or after it has entered the data lake):
1. Data protection at the source application: In this scenario, the data is encrypted before importing into Hadoop. This is the ideal scenario. This ensures that data is protected throughout the entire data lifecycle as well as Hadoop is not in the scope for compliance purposes. This option requires an interface to the source applications for encryption and tokenization. The protected data is then imported into Hadoop.
2. Data protection during import into Hadoop: This option does not need any access to the source applications. Here data is protected in the landing zone as it enters Hadoop.
3. Data protection within Hadoop: This option protects data fields once they are identified in Hadoop. This option uses interfaces running within Hadoop jobs. There will be integrations with different modules in Hadoop like Hive, Impala, Sqoop, Spark, Storm, Kafka, NiFi etc.
4. Storage-Level Encryption within Hadoop: The storage level encryption protects data after physical theft or accidental loss of a disk volume. This option uses Transparent Data Encryption (TDE) within Hadoop Distributed File System (HDFS) to create a safe landing zone. This option slows down the system. For better security, keys should be managed on Hardware Security Modules when using TDE.