In the previous challenge we have created a tool for John Hancock, that scans input CSV files for sensitive data and replaces them by placeholders in the output. In this first-to-finish challenge you will make some updates and fixes of the winning code (look into the challenge forum for the code and sample data). The following changes are in scope:
- Now the values in context column of the output log CSV file shows pieces of records with sensitive data already replaced by placeholders. Change it to show the piece of original record in the context column.
- Now the code implements two rules for replacement of sensitive information: CREDIT_CARD and SENSITIVE_SEQUENCE. It turns out (try to use the tool on accounts sample CSV provided in the forum) that if some sequence is matched as SENSITIVE_SEQUENCE (and it is properly reported in the log file), in the main output CSV file the matched information is still replaced by the CREDIT CARD placeholder. This should be fixed.
- We want to add a new rule which will match and replace account and routing numbers. The example CSV file provided in the forum has some examples of account numbers (with all digits replaced by zeros, but the length kept without change). The client says though that there is no specific length or format for these numbers, so any long sequence of digits (with optional delimiters) can be an account or routing number. This somewhat intersects with credit card and unknown sensitive sequence numbers, and thus we want to follow such rules:
- Consider the following symbols as valid delimiters: dashes, underscores, dots, commas, colon, semi-colons, parentheses, whitespaces
- If a sequence of digits, optionally separated by delimiters, looks like a valid credit card number (current CREDIT_CARD rule checks the number length and some pieces of the number fixed for the most popular card issuers) - replace it as a credit card number, just following the existing rule for that;
- Otherwise, if we have a long sequence of digits (say six and more for now), optionally separated by delimiters, and the original record text contains keywords account, routing, etc. near the number (say within 25-50 symbols around the number) treat it as an account / routing number etc. and replace by corresponding placeholder, like ::ACCOUNT_NUMBER::, ::ROUTING_NUMBER::, etc. the exact keywords and corresponding placeholders for this rule should be exposed in the config file. Note that in this case we want to be sure that the context value output into the log includes that account/routing keyword, and shows some more text around it (i.e., the current rule for context generation is to show 25 extra symbols around the number on each side of it; for this rule we want to show these 25 symbols around the keyword and number, and all text inbetween the keyword and number).
- Otherwise, treat that sequence of digits as an unknown sensitive number and replace with the current sensitive number rule.