Some more tips on analyzing incoming source data (referred as IFS in the telco project that I handled). Datawarehouse in a Nutshell: garbage in is garbage out. From a quality standpoint, throw away all the garbages before any app moves to production. The earlier it is detected, the less headaches, the cheaper. This again is part of the KT manual I created.
1. Most common issue is PK/integrity checks.
If there’s only one PK (natural key) stated in IFS specs, then IFS should not have two entries with the same natural key, this would result in Oracle constraint error during loading.
Product Code Desc Product Type
1 Class A 001
1 Class B 002
Possible course of action:
1) It could be source issue. Ask for Regen
2) Recheck data model. Maybe there’s a need to check whether natural key is defined correctly. There’s a possibility to explore addition of another new column, (eg. Product Type) which should be part of natural key to determine PK.
2. Not null columns.
Whatever primary columns or columns defined as not null in IFS specs, these columns should not be null in IFS.
Product Code Desc
1 Class A
Class B
Possible course of action: Ask for Regen
3. Unknown codes in Fact Tables
Logically, whatever codes submitted in Fact files should be present in Code tables, or else this would result to lookup failure:
Fact IFS:
Month Product Code Amount
200910 1 500
200910 3 400
Code IFS
Product Code Desc
1 Class A
2 Class B
Possible course of action:
1. If unknown code in Fact IFS is valid (product code = 3), then Code IFS should have regen.
2. If unknown code in Fact IFS is invalid, then Fact IFS should have regen.
3. Or customer can decide that it is ok to keep it as is.
4. Different Cases/Forms that cause mismatch during lookups
As pointed in above point#3, logically, whatever codes submitted in Fact files should be present in Code tables, or else this would resolve to lookup failure. However there are times that manipulation is needed in ETL side to make the join to be successful. Eg: functions like: UPPER, LOWER, TO_CHAR, LTRIM, RTRIM, SUBSTR, ABS, etc.
Fact IFS:
Month Product Code Amount
200910 101a 500
200910 000203B 400
Code IFS
Product Code Desc
101A Class A
203B Class B
Closely looking at the values, the values do exist but not exactly in the same form. This should tell you that these are not really lookup failures.
Possible course of action:
1. Agree with client who should change which and what should be handled in ETL in their side or our side. Other than the agreed things, then whatever that still fails lookup is considered ‘valid’. Revisit the logic when more data has been loaded. Document such to have reference in the future.
5. Volume of Data
The volume of data should more or less be consistent. If a fact file contains 3M records on a daily basis, then it shouldn’t fall too low, eg 100k records. For additional info, data volume is tracked for several other purposes. This is usually checked in SIT.
a. One is for tracking performance: comparing ETL processing time against different IFS files having different data volume.
b. Another purpose is for is tracking compression (in comparison to aggregated tables), this is to check whether compression targets are met or not. If targets are not met, then there is a need to revisit design. Some columns might not be suitable to be part of the group by clause.
Possible course of action:
If IFS file suddenly has too low rowcount, it has to be informed to client to check from their source. This might need IFS regen from M1.
6. Other Common IFS Issues:
• Duplicates - same entry of exactly same records are sent twice.
• Default values. Default values as indicated in IFS specs are not followed. There are also cases where agreed values are Y or N, but in actual IFS, it is showing null values, X values, etc.
• Missing footer record. ETL program from M1 side has to be completed, and thus trailer record will be populated.
• Column length too long.
• Missing or Additional column in IFS
• Extra spaces for data some columns in IFS, special characters.
• Invalid filename
No comments:
Post a Comment