PII detection¶
The PII feature applied to Sidra's Platform can evaluate Personally Identifiable Information (PII) that is being ingested into the platform, helping to ensure sensitive data is properly managed and governed. PII can be defined as the information or any data that can be used to distinguish or trace an individual's identity such as name, SSN and biometrics information; either alone or combined with other identifiers such as date of birth or place of birth (quasi-identifiers or linkable information), according to the General Data Protection Regulation (GDPR) and U.S. government.
PII Detection is only available for Data Intake Processes. This option is not available for data intakes via landing zone, such as the ingestion of CSV files.
When is PII detection happening?¶
PII detection happens at Data Intake Process level. When the trigger configured in each of the configured Data Intake Processes is hit, there is an orchestrator pipeline that is the responsible of orchestrating the metadata and the data extraction execution. For PII detection, there is an additional step after data extraction to check whether PII detection process needs to be run on any of the Entities associated to the Data Intake Process at hand.
To configure whether we want to extract PII from an Entity or from all Entities assigned to a DIP we have two different settings:
-
One, at Data Intake Process (DIP) level, in the
additionalProperties
of the DIP. The fields are calledpiiDetectionEnabled
andlanguage
. The value forpiiDetectionEnabled
is a flag to configure whether we detect PII for that DIP or not. The language is a field to configure the language to be used by the Presidio analyzer to analyze the field. For example: -
Another setting, at Entity level, in the
additionalProperties
field of the Entity. For example:
From these two settings, the Orchestrator first checks with the Sidra API what is the effective set of Assets for which we need to run the PII detection script (see below on "Details about the PII detection and classification of PII").
This endpoint will do an effective settings calculation as described below:
- If the setting
piiDetectionEnabled
is empty at Entity level, the setting will default to the DIP level. - If the setting
piiDetectionEnabled
is set tofalse
at Entity level, then it will not detect PII, no matter what is the value at the DIP level. - If the setting
piiDetectionEnabled
is set totrue
at Entity level, then it will detect PII for that ENtity, no matter what is the value at the DIP level. - If the setting
piiDetectionEnabled
is empty at DIP level, the setting will default tofalse
, so it will not detect PII for the Assets of the Entities in that DIP. - If the setting
piiDetectionEnabled
is empty at Entity level and is set tofalse
at DIP level, it will not detect PII for the Assets of that Entity. - If the setting
piiDetectionEnabled
is empty at Entity and the settingpiiDetectionEnabled
is set totrue
, it will detect the PII for all the Entities in that DIP. - If the setting
piiDetectionEnabled
is empty both at DIP and at Entity level, it will not detect PII for any of the Assets
The endpoint will return the list of effective Asset IDs for which we need to detect PII.
Sidra implements the PII detection via a notebook called piidetection.py
. The orchestrator ADF pipeline will not call the PII detection notebook if there is no Asset for which to calculate the PII.
A similar set of rules as above is used to calculate the effective language for PII detection.
- If the setting
language
is not specified, the default value is English in any of the cases (DIP or Entity). - If the setting
language
is not specified at Entity level, the setting will default to the DIP level. - If the setting
language
is empty at Entity level and at DIP level, the language will default to english. - If the setting
language
is empty at Entity level and the settinglanguage
is spanish at DIP level, the language will be spanish. - If the setting
language
is specified at Entity level, it will take that value, regardless of the value at DIP level.
English and Spanish installed by default when creating the cluster. No other languages are supported at this time.
Additionally, PII detection in Sidra counts with an Application Insights service responsible for the tracking of possible errors during the execution of the notebook PII Detection
, enabling notifications for both success and errors executions.
Details about the PII detection and classification of PII¶
For PII detection, Sidra is using a Microsoft library called Presidio. Presidio, as stated by the project owners, helps to ensure sensitive data is properly managed and governed. It provides fast identification and anonymization modules for private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data and more.
In Sidra only functionality for classifying and detecting PII is being used, and described below.
The following classification categories are detected by Presidio when run on any Entity or Attribute:
- Credit card
- Crypto (a crypto wallet number)
- Date time
- Domain name
- Email address
- IBAN code
- IP address
- NRP
- Location
- Person
- Phone number
- Medical license
- US bank number
- US driver license
- US ITIN (Individual Taxpayer Identification Number)
- US passport
- US SSN (Social Security Number)
- UK NHS
- NIF
- FIN/NRIC (Singapore National Registration Identification Card)
- AU ABN (Australian Business Number)
- AU ACN (Australian Company Number)
- AU TFN (Australian Tax File Number)
- AU medicare (Australian medicare number)
Metadata tagging process for PII information¶
During this scanning by column of Attributes, tags (which type in Sidra is Autogen) will be assigned when finding PIIs. Two types of tags are assigned to Attributes:
- A tag to flag whether an Attribute is classified as PII (tag =
PII
) or not (no tag). - A tag or set of tags specifying the type of PII detected, e.g.
Person
, orPhone Number
.
The Entity to which each of Attributes belong, will have this same tag assigned (PII
if PII is detected and no tag if PII is not detected).
Also, at the Entity level, an effective union of the classification tags of underlying Attributes is generated and assigned as tags. For example:
- If one Entity has only two Attributes, one child Attribute is classified as Person, and another Attribute is classified as Phone number, then the parent Entity will these tags: PII, Person and Phone Number.
At the Provider level:
- The process will just add the flag to classify whether underlying Entities have PII or not.
- No PII type classification tags are applied at Provider level.