Legal Data Security Requirements

It is vital to establish any legal requirements on the data being stored. If individual customer data is being held, such as customer information for personal phones in a telco data warehouse or account details in a banking data warehouse, you may be required by law to enforce certain restrictions. The law varies from country to country, so it is impractical to enter into details, but there are certain fundamental questions that can be asked. If the data held online is going to be used for trend analysis, and is therefore held in summarized rather than detailed form, do any legal restrictions apply? Holding data only in summarized form does of course have the downside thay detailed drill-down operations are impossible.

If the customer data to be held on the data warehouse is of a minimal nature, what legal restrictions apply? It may be that the data is sufficient to identify an account but not its owner, thereby preserving anonymity, his would be sufficient to allow trending and profiling to be done blindly. If the results were to be used to put together a mail shot of a new product to identified account holders, the account ID numbers could be transferred to an operational machine where the data could be joined to customer data to cross-match account ID to customer details.

Apply Restrictions using VIEWS

The use of views to enforce security will impose a maintenance overhead. In particular, if views are used to enforce restricted access to data tables and aggregations, as these change, the views may also change. It is not un-typical for data to be stored in multiple daily tables, and rolled up into a weekly table at the end of a week. This will require the creation of all the restriction views on the weekly table when it is created. Add to this the problems of keeping track of users as they move job and as new users are added, and it soon becomes a major task just maintaining the views. This problem can be alleviated somewhat by using the role approach to security. In that case, a user moving jobs will generally require the removing of one database role and the granting of another to that user.

Another problem with using views is that they impose a parse overhead when they are accessed by an SQL statement. As well as the SQL statement itself being parsed, the underlying statement that makes up the view has to be merged into the statement and parsed. For complex views this can add seconds, or even minutes in the case of very large complex views, to the SQL statement’s running time. This may not be an issue in a data warehouse, where many of the queries will be long running and take hours to perform, in which case the long parse times become irrelevant.

High Security Environments

If higher levels of security are required, it may be necessary to go to “trusted” versions or the RDBMS that have been audited to certain levels of trust. For exaJ1lple, the normal version of a RDBMS might conform to C2 in the NSCS’s Trusted Computing System Evaluation Criteria (TCSEC or “Orange Book”) in the USA and F-C2JE3 according to the European Information Technology Security Evaluation Criteria (ITSEC). This compares to a trusted version of the RDBMS, which might conform to BI of the TCSEC and F-BIJ.E3 of the ITSEC.

The trusted RDBMS might contain, amongst other functionality, row labels and security clearance. Row labels allow each row of data to be classified, and therefore data of different classifications can be stored in the same table. Security clearance applies to users, and dictates which classifications of data they are allowed to see. The trusted RDBMS automatically enforces the data restrictions, allowing a user to see only data of their own clearance and below.

Trusted RDBMSs will generally run only on trusted operating systems, where the operating system itself conforms to trusted levels of security, thereby ensuring the security of the database files and everything external to the database itself. The overheads and costs of both the trusted operating system and trusted RDBMS are enormous, and are impractical for implementing a data warehouse unless these extreme levels of security are genuinely required. It is also worth noting that every aspect of the analysis, design and testing of the data warehouse will be affected by using such a system, leading to greatly expanded design and test schedules, and thereby vastly greater costs.

Covert Channels

A covert channel is any method whereby restricted information is inadvertently given away by implication rather than by design: for example, if there are two reports, one that displays only records that the user is authorized to see, while the other displays averages based on all records. If users calculate the same averages on the data displayed in the first report, they can infer information from any data mismatch. The second report implies the existence of other records, and in fact may also give some information about the hidden records, by the difference in the averages. It is important to avoid introducing covert channels, particularly in high security environments. Covert channels can occur very simply during the design stage. Covert channels are not typically a problem, as the majority of data warehouses do not require such high levels of security.

Managing Network Requirements

When doing the security requirements capture it is important not to overlook issues such as network security. For instance, Is it necessary to encrypt data before transferring it to the data warehouse machine? Are there restrictions on which network routes the data can take? These restrictions have processing implications that need to be carefully considered. The overheads of data encryption and decryption can be very high in both time and processing power. It is also important to note that the cost of encryption is borne by the source system, and that can prove to be very expensive if the system is an already loaded system.

Network route limitation can narrow options, and may cause bottlenecks on certain sections of the network. Restricted routing also leads to greater dependency on the restricted routes, which may be disastrous if they fail. Each time that data is to be moved there are potential security implications. Suppose some restricted data is to be transferred as a flat file to be loaded. It is important to avoid accidental or deliberate flouting of the security restrictions. If a user with access to restricted data can create generally accessible temporary tables, data can be made visible to non-authorized users. This can be overcome by having a separate temporary area for users with access to restricted data.

Documenting Security and Audit Requirements

It is important to get all the security and audit requirements clearly documented as this will be needed as part of any cost justification. This can be done as part of the SLA, but it is probably better to document the restrictions as part of a separate data warehouse security policy document as well as any design decisions on how the security will be implemented. If roles are to be used, a detailed description of every role should also be included, along with the access rights and privileges of that role. Security always costs. This is a statement that should be remembered at all times. Any security that is implemented will cost in terms of either processing power or disk space, or both. These costs eventually turn into monetary costs somewhere along the line. Views are a standard RDBMS mechanism for applying restrictions to data access. In theory they are treated just like a table, but in reality a view is a stored query in its own right, and in practice there will generally be restrictions on how views are handled by the RDBMS.

Data Warehouse Audit Requirements

Auditing is a specific subset of security that is often mandated by organizations. In particular, there may be legal requirements for auditing on operational systems. Given the volumes of data involved in a data warehouse, auditing can cause extremely heavy overheads on the system. To make up for these overheads, and to allow operations such as the overnight processing to still finish in time, generally requires more hardware. Thus, where possible or allowed, auditing should be switched off. There may .be many different audit requirements, but as far as the data warehouse is concerned all audit requirements can basically be categorized into connections, disconnections, data access and data change.

For each of these categories it may be necessary to audit success, failure or both. For security reasons, the auditing of failures can be particularly important, because they can highlight any attempted unauthorized or fraudulent access. If data access is to be audited, you need to establish whether each access is to be audited separately, or whether it is sufficient to audit the fact that a user accessed a specific table during a given session. This can radically affect the audit information that needs to be held, saving both space and I/O overheads. If changes to the data are being audited, is it sufficient to audit the fact that a change occurred, or is it required to capture the actual change that was made? Data warehouses are largely read-only environments except for the data load, and therefore data is unlikely to change once it is loaded. If the audit requirement on data change exists purely to prevent data being changed, it may be sufficient to protect the data by making it read-only, and then auditing any change from read-only to read-write.

Making Sure Data is Secure

One approach to the data security problem is to place data in separate tables depending on its classification. This allows data to be accessed directly, because no view is required for each user or role to filter unauthorized data. This approach can lead to massive duplication, as there may be .large overlaps in the data that roles can access. Take the example used above. To solve the problem of users’ not seeing their own account data we could create a separate copy of the table with just that user’s data missing for each user. The overheads in disk space and processing power would be enormous.

Another solution would be to load all non-employees data into one table and just create separate tables as above for the employee data. This approach will not cause as much duplication, and at first glance seems a better solution, but it is still fraught with difficulties. For instance, what happens if one of the non-employees is subsequently employed by the bank as an analyst who has access to the data? In particular, how do you deal with historical data that exists in the non-employee table? Do you re-create it all without the new employee’s data?

There are no easy answers to these questions, and it is important to understand the reasons for the restrictions, so that the most practical solution can be arrived at. Again, using the example above, if the rule that an employee may not see his or her own account information is based on not seeing a particular field in the records, you might ask whether it is strictly necessary to hold that field in the data warehouse. If the field is not really required, a better solution would be not to load the field in the first place. Even if the field is genuinely required, would it matter if all employee records were loaded with that field nulled out or overwritten with some fixed value?

Role Access Hierarchy

It is worth considering future requirements at this stage, particularly if it is likely that users will begin to access overlapping sets of data. Remember that the base data is likely to be the same for everyone, and it is only if users are completely restricted to aggregations that there is likely to be wide separation of access. If, in the future, drill¬down from aggregated to detailed data is likely to be required, or a larger user community is to be added, the chances of data access requirements overlapping will Increase.

If all the data is generally available and being accessed by every department, it is usually better to approach the problem from a role perspective, where roles such as analyst, manager or administrator are defined across departments. This allows security to be applied by function, with analysts having query access only, while administrators have permissions to move and possibly change data. This approach also allows access to data to be assigned to roles rather than people. Roles can then be built up from other roles. Which approach is better will depend on the specific set of security restrictions that need to be applied. In general, the tighter the security, the more person oriented the design. Also, the more complex the interdependencies, the more role oriented the design.

User classification for increased Security

As with the data, there are a number of ways that users can be classified. You can take a top-down company view, with users classified by department, section, group, and so on. Another possible classification is role based, with people grouped across departments based on their role. This approach would classify all analysts as one group, irrespective of the department they are in within the company. Once the users have been classified, you can design the users’ access around that classification.

To explore these concepts further let us consider a data warehouse where the users come from two different departments, Sales and Marketing. As mentioned above, it is possible to design security by a top-down company view, with access centered around the different departments. However, it is likely that within each department there will be further restrictions placed on users at different levels. This can lead to complex nested structures of access restrictions.

If each department genuinely accesses different data, then it is probably better to design the security access for each department separately. You may even need to set up security at the individual level if every analyst has completely different requirements. If this is the requirement, it may be worth considering the use of departmental data marts. The data marts can be separated from the data warehouse, and the security restrictions can be enforced separately on each data mart. This will allow the data warehouse itself to be designed without the restrictions, which in turn will help to avoid any problems with future requirements.