4.4 Statistical Masking
Statistical masking provides an extensive set of tools that can be used to mitigate potential risk in a given data presentation. If Step 3 of the Data Assessment for Public Release Procedure (Figure 5) determined that the dataset has a risk that small numerators or small denominators may result in conditions that put individuals at risk of being re-identified, then the dataset must be assessed to determine the need for statistical masking of those small values and complementary values. In performing the statistical masking, the data producer must consider what level of analysis may be sacrificed in order to produce a table with lower risk.
4.4.1 Suppressing Small Counts
One common way of masking data is to suppress cells that contain small counts by following these steps:
Suppress cells (e.g., count of members and services provided to members) <11 (excluding 0) when total score ≥13.
After the cell suppression (<11 excluding 0) is completed, complementary cell suppression is also required so that the suppressed cells cannot be re-identified. See “Complementary Cell Suppression” section 4.4.3.
Values of 0 should not be suppressed since a non-event cannot be identified.
Suppression is also required for financial data which can be associated with members or services provided to members <11 (excluding 0).
Suppression is required for all the associated statistical entries (e.g., Prevalence rates, percentages, Mean etc.) of the suppressed cells.
An additional complementary cell needs to be suppressed if (a) OR (b) is true:
all of the values suppressed in a specific group (row or column) are each ≤ 3 (1, 2, 3 excluding 0)
the sum of the values suppressed is less than 11
4.4.2 Other Masking Methods
Other masking methods may be applied when one of the following conditions is met:
a) Multiple variables. This most often occurs in a pivot table presentation or a query interface where a user may have occurrences of disease X, stratified by multiple variables, such as age, sex, race, and ethnicity.
b) Granular variables. The more granular the variable the smaller the potential numerator and denominator. This most commonly occurs with shortening the time period of reporting (weekly) or making the geography more specific (zip code or census tract). However, it can also occur when there are many categories for a variable. An example of this is aid codes in Medi-Cal where there are almost 200 aid codes.
c) Rare events. Examples include diseases such as hemophilia. Another example is mass trauma events such as a plane crash or multi-car accident.
In each of these cases, statistical masking may be addressed in a number of ways. For this reason, it is important to keep in mind the purpose of the reporting so that the method chosen for masking can still maximize the usefulness of the data provided. Choices for each case are highlighted below.
a) Multiple variables. Options include separating the table into multiple tables that limit the number of variables included in each table; decreasing the granularity of the variables included in the table; or suppression of counts <11. For example, if there are six variables of interest for study, but a table that cross-tabulates all six variables produces a large number of small cells, the data producer could consider producing several tables with fewer variables so that the risk score is <13. This is especially effective if there are very few analytic questions requiring a cross-tabulation of all six variables.
b) Granular variables. A common approach to this situation would be to decrease the granularity of the variables (although suppressing counts <11 is also an option). This is especially useful for variables with a large number of categories that can be easily aggregated to fewer categories while still maintaining much of their utility. Geographic variables such as state or county can often be recoded into regional categories that still serve the analytic needs of the data user. It is also the only table restructuring option for tables with only two or three variables which have limited opportunities for variable reduction.
c) Rare events: In these cases, it is challenging to suppress the value so that it cannot be used with other public information to identify individuals. Additionally, with rare events, there is more significance in the variance of small numbers. The above-mentioned suppression rules minimize the risk of re-identification most times. However, an expert should treat each data on a case-by-case basis and add additional rules if there is a risk of re-identification in any data. Please see 4.4.3 for a couple of examples in which all the above rules are covered but note that if it is revealed that the cells are suppressed due to regular suppression (<11) and not for complementary suppression then all the suppressed cells can be re-identified.
4.4.3 Complementary Cell Suppression
Complementary cells are those that must be suppressed to prevent someone from calculating the suppressed cell based on row or column totals in combination with other data in that row or column. For example:
Example 1: 10-10-10
Count of Medi-Cal Members by Age
A1
10
A2
14
A3
10
A4
10
A5
0
A6
0
A7
0
A8
30
Total
74
In the above example, if we suppress the three highlighted counts for cells A1, A3 and A4 (in orange, each with values of 10) and if we reveal that it is due to regular suppression of cells <11 then anyone can guess that each cell is 10. In this case, either we should not specify that the three cells are <11 or suppress a complementary cell A2 (with a value of 14) so that the three cells highlighted in orange could not be identified.
Example 2: 10-9
Count of Medi-Cal Members by Age
A1
10
A2
14
A3
9
A4
17
A5
0
A6
0
A7
0
A8
30
Total
80
In the above example, if we suppress the two highlighted counts for A1 and A3 (in orange, with values of 10 and 9) and if we reveal that it is due to regular suppression of cells <11 then there are only 2 possible combinations (A1=10, A3=9) or (A1=9, A3=10). In this case, either we should not specify that the two cells are <11 or suppress a complementary cell (A2, with a value of 14) so that the cells highlighted in orange could not be identified.
Example 3. When to suppress 0?
Counts and Percentages of Medi-Cal Members by County
XXX
3
0.0
YYY
15
1.0
ZZZ
0
0.0
In this example, the percentage of 0.0 should not be suppressed for County ZZZ because it is based on a non-event. However, the percentage of 0.0 for County XXX needs to be suppressed because it is due to rounding of numbers. For example, if the denominator for the County XXX percentage is 7,500, a count of 4 would have a rounded percentage of 0.1. Therefore, it could be inferred that the count for County XXX is 1, 2, or 3 because a count of 3 is the highest value that would have a rounded percentage of 0.0 and counts of 0 are not suppressed. Consequently, summary statistics based on suppressed counts should not be reported even if the rounded value is 0 due to the potential for the information to be used for inference of suppressed values.
Example 4. When does indication of complementary suppression lead to data re-identification?
Count of Medi-Cal Members by Age
A1
14
A2
14
A3
1
A4
11
A5
0
A6
0
A7
0
A8
30
Total
70
In the above example, if we suppress the two highlighted counts for A3 and A4 (in orange, with values of 1 and 11) and if we reveal that A3 is due to regular suppression and A4 is due to complementary suppression then with the given total both the cells can be re-identified. In this case, we should not specify the nature of the suppressed cell so that the cells highlighted in orange could not be identified.
In these cases, it will be necessary to suppress small cells and perform complementary suppression to ensure that precise values of small cells cannot be calculated using the values of unsuppressed cells and marginal values. In the simplest case, this means ensuring that each column and row of a two-dimensional table has at least two suppressions. This ensures that the precise values of the suppressed cells cannot be calculated. Complementary suppressions are often selected using one of the methods listed below.
The ‘analytically least interesting’ level of a particular dimension. This is often ‘other’, or ‘I don’t know’.
The smallest cell available for complementary suppression. This is based on minimizing the ‘information loss’.
The cell most similar to the cell needing complementary suppression, such as adjacent age groups. This can produce complementary suppression that may be easier to interpret. It is important to clearly designate which cells have been suppressed due to complementary suppression. Use a symbol to indicate the cell has been suppressed. Identify any other cells (complementary cells) that can be used to calculate the small cell and use a symbol to indicate the cell has been suppressed. Please see below two ways to indicate cell suppression.
Suppression symbols for Machine Readable Version: The Open Data Portal requires submission of a machine-readable format. Therefore, the CalHHS Open Data Portal guidelines provide instructions on the table structure.
CalHHS Open Data Portal Small Cell Suppression Guidelines
Guidelines: Use an annotation field (column) in each data table that corresponds to records that have suppressed cells.
Small Cell Data Standard:
Value in cell is blank if blank due to “annotation”
“0” in cell if value is 0
Data Dictionary / Metadata indicates small cell method used (<11, etc.)
Annotation field table:
0 or blank
No annotation or blank
1
Cell suppressed for small numbers
2
Cell suppressed for complementary cell
3
No data is available
4
Statistically unstable value
5
Incomplete data
Considerations:
Use metadata and documentation to inform users.
Consider highlighting and drawing attention to annotated fields.
2. Suppression Symbols for Non-Machine-Readable Version: Departments may want to present data in a non-machine-readable format for usability of the data. In this case, use the symbols "S," "*," or similar symbols for counts less than 11. Use symbols “C”, “**”, or similar symbols for complementary cells. When suppressing values, it is recommended to use the following footnote to indicate the suppression.
Values or cells marked as “S”, “*”, or a similar symbol in the data would have the following footnote or note:
“S” represents Counts that are less than 11 which are not shown in accordance with the CalHHS DDG Edition 2.0.
Values or cells marked with “C”, “**”, or similar symbols indicate complementary cells would have the following footnote or note: “C” represents counts for complementary data that are not shown in accordance with the CalHHS DDG Edition 2.0.
4.4.4 Masking a Three Category Variable (e.g. Intersex Status)
A special situation may occur when complementary cell suppression must be applied to a variable with three categories. This occurs when “Intersex” is an option for the Sex variable (Male, Female, Intersex) but can occur in any situation where a variable has two large categories and one small category (for example, if the options for Gender Identity included Man, Woman, and Nonbinary).
As an example:
545
545
10
1100
If the DDG risk assessment concludes that counts less than 11 must be suppressed, then complementary suppression must be applied to either the Male or Female categories so that the Intersex count of 10 cannot be back-calculated. The problem is that suppression of either category would deprive the public of vital information and hence be publicly unacceptable. Other masking methods (such as combining Intersex with either the Male or Female categories; or reassigning the Intersex individuals to both categories equally) will likely also be unacceptable as they would be viewed as erasure of the Intersex identity.
In these instances, the recommendation is to not show counts at all, but only percentages rounded to the nearest whole percent. Such a “rounding error” will effectively mask the smallest category without requiring complementary suppression.
Here are the raw percentages for the above example:
49.545%
49.545%
0.009%
100%
Here are the rounded percentages:
50%
50%
<1%
100%
Once the percentages are rounded, one cannot back-calculate to obtain the Intersex count, even if one knows the total is 1100. One would only know that the count is less than 11, which is the same knowledge as with standard suppression practices. It is recommended that a footnote is provided to explain why this was done, for example:
“To protect individual privacy, counts are not provided for this table and percentages have been rounded to the nearest whole number. Due to rounding, totals for all categories may not add up to 100%.”
A few other points regarding this method:
Ensure that counts for the variable are not provided anywhere else in the document and are not available in the public domain. In the above example, one needs to make sure one cannot obtain a total count of 545 males and females elsewhere (for example, in a table that cross-tabulates sex by race).
This method is only effective if the total number of individuals is 1100 or above. Below this threshold, one can back-calculate counts of less than 11 even when percentages are rounded.
This method is only needed for a three-category variable. In the above example, if Intersex were presented as a separate “yes/no” variable and no cross-tabulation with Sex was provided, then complementary suppression would not be required, and this issue would not arise.
4.4.5 Balancing privacy and equity goals
Data de-identification, which ensures individuals are not made identifiable by public facing data, and the use of masking to preserve privacy and confidentiality, exist alongside the need to produce high-quality analyses focused on equity and disproportionality. Underrepresented or marginalized communities are more likely to have their data masked in data de-identification efforts due to smaller cell sizes, which can result in a lack of information that can be safely made publicly available about those populations. Section 4.4.2 outlines suggestions on how to preserve information while also preserving confidentiality. Intuitive ways to preserve information while also preserving confidentiality include presenting data as percentages, aggregating smaller groups into reportable-sized ones, or multi-year reporting that relates marginalized or underreported groups to an average. In addition, the DDG Peer Review Team remains committed to evaluating ways to continue to balance these goals through continued training and other innovative methodologies on an ongoing basis, including those listed in Appendix E.
Last updated
Was this helpful?