Thursday, December 17, 2009

Last Words...

This summarizes what I've been doing in IT. Of all the people I worked with, I would say this is definitely the best group, everyone is technically capable and work ethics is superb. (I wont talk about process-oriented or management tasks: that is given).

And what is more awesome is to receive praises from these experts - similar to hitting 500K in Bejeweled Blitz hehe. During the last day (btw I'm proud to say i didnt cry!), one of the leads commented he has no idea how I managed to do these things when I was never in a single BRD (business requirements discovery) session and I also dont do ETL/OLAP development. I said I was just guessing from what I see. 'but your guesses are always correct'. So fun. :D (thank GOd I probably didnt have much booboos).

So that's it, I'm leaving this behind to do something else. Scary. Back to 0. Still the same goal: I want to retire!!! :p

Funny that I actually took time and managed to write down something related to work. DWH/SQL/data - whatever it is, it has become a part of me for the past 6.5years. I might not be vocal or expressive of things I appreciate. For the last time, I give credit when credit is due. Thanks for keeping me company and making me happy.

P.S. The time I speak the truth might be the time I bade goodbye. *sobs*

Data Checks (2 of 2)

Some more tips on analyzing incoming source data (referred as IFS in the telco project that I handled). Datawarehouse in a Nutshell: garbage in is garbage out. From a quality standpoint, throw away all the garbages before any app moves to production. The earlier it is detected, the less headaches, the cheaper. This again is part of the KT manual I created.

1. Most common issue is PK/integrity checks.
If there’s only one PK (natural key) stated in IFS specs, then IFS should not have two entries with the same natural key, this would result in Oracle constraint error during loading.
Product Code Desc Product Type
1 Class A 001
1 Class B 002

Possible course of action:
1) It could be source issue. Ask for Regen
2) Recheck data model. Maybe there’s a need to check whether natural key is defined correctly. There’s a possibility to explore addition of another new column, (eg. Product Type) which should be part of natural key to determine PK.


2. Not null columns.
Whatever primary columns or columns defined as not null in IFS specs, these columns should not be null in IFS.
Product Code Desc
1 Class A
Class B

Possible course of action: Ask for Regen


3. Unknown codes in Fact Tables
Logically, whatever codes submitted in Fact files should be present in Code tables, or else this would result to lookup failure:

Fact IFS:
Month Product Code Amount
200910 1 500
200910 3 400

Code IFS
Product Code Desc
1 Class A
2 Class B

Possible course of action:
1. If unknown code in Fact IFS is valid (product code = 3), then Code IFS should have regen.
2. If unknown code in Fact IFS is invalid, then Fact IFS should have regen.
3. Or customer can decide that it is ok to keep it as is.


4. Different Cases/Forms that cause mismatch during lookups
As pointed in above point#3, logically, whatever codes submitted in Fact files should be present in Code tables, or else this would resolve to lookup failure. However there are times that manipulation is needed in ETL side to make the join to be successful. Eg: functions like: UPPER, LOWER, TO_CHAR, LTRIM, RTRIM, SUBSTR, ABS, etc.

Fact IFS:
Month Product Code Amount
200910 101a 500
200910 000203B 400

Code IFS
Product Code Desc
101A Class A
203B Class B

Closely looking at the values, the values do exist but not exactly in the same form. This should tell you that these are not really lookup failures.

Possible course of action:
1. Agree with client who should change which and what should be handled in ETL in their side or our side. Other than the agreed things, then whatever that still fails lookup is considered ‘valid’. Revisit the logic when more data has been loaded. Document such to have reference in the future.


5. Volume of Data
The volume of data should more or less be consistent. If a fact file contains 3M records on a daily basis, then it shouldn’t fall too low, eg 100k records. For additional info, data volume is tracked for several other purposes. This is usually checked in SIT.

a. One is for tracking performance: comparing ETL processing time against different IFS files having different data volume.

b. Another purpose is for is tracking compression (in comparison to aggregated tables), this is to check whether compression targets are met or not. If targets are not met, then there is a need to revisit design. Some columns might not be suitable to be part of the group by clause.

Possible course of action:
If IFS file suddenly has too low rowcount, it has to be informed to client to check from their source. This might need IFS regen from M1.


6. Other Common IFS Issues:
• Duplicates - same entry of exactly same records are sent twice.

• Default values. Default values as indicated in IFS specs are not followed. There are also cases where agreed values are Y or N, but in actual IFS, it is showing null values, X values, etc.

• Missing footer record. ETL program from M1 side has to be completed, and thus trailer record will be populated.

• Column length too long.

• Missing or Additional column in IFS

• Extra spaces for data some columns in IFS, special characters.

• Invalid filename

Data Checks (1 of 2)

Sharing some techniques. I drafted this one time when I was asked to do some 'knowledge sharing'. It's actually just common sense and some room for imagination to cater for 'intelligent assumptions'. :)

Scenario: Amounts do not tally.

1) Open the report in UAT instance. If specific account number or cust is given, search for that number. Replicate the issue for those accounts/CIFs raised, you can bring in other fields into the report (eg Sector Code, Sub Product Type, etc) if you need more information.

2) Check how the amount is calculated. The amount field can be an OLAP variable, look for the specific table.column in ETL that is used and do the calculation manually. If it’s a direct fetch from EDW, then query the database.

3) Next is to query the source itself. If it’s a lookup field from another instance, then look for specific join conditions in actual ETL mapping. If let’s say all your amounts are null but in Raw has value, then something is wrong in EDW. Either it’s an ETL lookup issue related to data, or requirement itself is wrong.

Eg: Issues:
- All debit fields are null because debit stream lookup to VBM in ETL in incorrect
- Some FD Credit balances are not matching and consistently lower in OLAP than in Raw, because VBM lookup in ETL has a component for renewal counters, and this field appears differently in VBM and GFDR and has to be padded with 0.

4) It could also happen that for some of the accounts match, some do not match, and there is no data format (trimming, padding) issue or ETL lookup error. This would mean that some of the filters (in either ETL side or OLAP side) are not proper

Eg: We had issues wherein we actually have to modify the filters, upon investigation that something is consistently wrong in our side. Sample issues:
BO: add filter to include Accounts Closed within Current Month. Remove filter on Receipt Status is Closed
ETL: remove filter on sub product type = GT.
Net, it can happen that requirement itself is incorrect. There are times when we cannot fully rely on what is stated in signed off mapping specs/report specs.

5) Another way is the comparison of total number of accounts. In this project, we normally go by account. Get the total number of accounts in OLAP vs the total number of accounts in Raw.
- It can happen that there are MISSING accounts from Raw but in EDW we take them in.
- Or the other way around, there are EXTRA accounts from Raw but in EDW we don’t have. With this, then you have to extract the accounts that are missing/extra, justify why it is or it is not there in EDW. Then do analysis and find a pattern where possible.

Apples and Oranges

Here goes a simple sample for a datawarehousing project from a bank. (note that this is just simple current accounts and fixed deposits, we're not even talking about equities, derivatives, bonds or forex).


In a dwh project, usually there will be some existing reports from old system to be compared against a new EDW. EDW will consist of ETL (processing source data and implementation of all data transformation) and OLAP (reporting layer where further filters and metrics can be applied). With this, data accuracy will be challenged and it's troublesome to find which part exactly has data gone wrong. Sometimes numbers can be critical, especially in banks. These numbers appearing in the report are read by Marketing or Sales people for their forecast, or even CEOs. A cent lost has to be digged out.


In layman's terms: my edw has 20 apples (15 are apple pies, 5 are apple muffins) and 30 oranges (10 are orange pies, 20 are orange muffins). Existing report shows 30 apples and 30 oranges. Somewhere data has gone wrong, it maybe during tagging of pies and muffins or it maybe there is a certain formula for classifying what is a valid apple and a valid orange. Literally comparing apples to oranges. (I hope it gives a better idea now) :D Depends on the type of project, if numbers dont tally, normally it can NOT go into production until the culprit is found. Or sometimes maybe the existing report is actually wrong. Or sometimes the business requirements are not correctly captured. Endless possibilities.


I'm actually not a SQL expert. I can only do simple queries and subqueries, not the too fancy ones. Think about it, it's quite fun to do those intersect, union to extract the correct data that you need in order to find the culprit.































































































Wednesday, December 16, 2009

Q409 blues

I seemed to make a decision that has several implications, those that I'm not sure whether I can afford the consequences. Just that deep inside I have this strange feeling that I had to.. And the worst part is simple thing as it is, there were some complications along with it of which I didnt expect things to turn out to be. It seems like everyday of November, I was thinking of a reason, an excuse, a strategy to get out of the situation the soonest. Lame..to the point I had to start making stories, or even checking legal constituents hehe? There are certain decisions when made, you can only afford to err on one side. The more indecisive, the more strings attached and the harder it is to be politically correct. I am not good with this, dont hate me. All's well that ends well anyway.

November was also my mom's bday. For a change, I wanted to surprise her. Thank God there were friends who helped to realize this. It's the first time I thought of giving her a 'real' un-informed surprise. I scheduled 2 deliveries within the week courtesy of Red Ribbon, for both her real and lunar bday. Her reaction: priceless. My reaction: bitter sweet. :(



I went to Bintan for a quick weekend getaway. I went there with guilt, stress and usual confusion, and flushed down all these with the unending waves. I went back cleansed and relaxed, hopefully. As perks, unexpectedly I got a free pictorial, check it out at flicker :)






Christmas and New Year. It's always my most favorite holiday, discounting the fact that the most painful things also happened the same time many years ago. Now it still pinches my heart a bit to see the beautifully-lit streets of Orchard now packed with even more glamourous malls, with me alone walking, even if I have companions. Strange that the emptiness doesnt get filled by just anyone. (Cool, serves me right) By the way my hausmate was homesick one day and sharing her memories of having christmas trees. Unfortunately I cannot relate, coz I never had one back home. (okay let's not dwell on difficult lives in this post :p) Passed by a store and thought of buying this 'snowflaking' crystal ball, a slight attempt to put a little decor in my room. :)

It's confirmed, my newest nephew coming sometime Feb is a boy! My alagas will be increased to 4. !!!

And I thought my previous offices are already big MNCs. (or maybe I also cannot directly compare the offices). It's quite exag to see the dual monitors per person in each spacious cubicle in the new office (one monitor seems not enough, they want the comfort of dragging screens in 2 display monitors) For me it doesnt matter, most important is I dont want laptop! I dont want carrying heavy stuff to and from office anymore :p, i dont want bringing any work home. so just give me my desktop hihi! Same goes with the hearsay of 2ksgd ergonomic chair we're sitting on (?). I feel barbaric, like coming from mountain inhabitation. Vain, but I like. :D

I just realize how much I love sql and data analysis, now that it seems I wont be doing much of it. For only this once in my career, I felt a sense of accomplishment. It has become a part of me... now is the time to recount our memories, before we go separate ways...*sigh* I will blog in detail in next post.
Merry Christmas everyone! I had 4 consecutive parties since last Friday, despite claiming to be anti-social. hohoho. Santa, grant me a major wish this year... Btw in one of the parties, I was told.. ang lamig ng boses while singing. Excellent, luv it Crazy for you by Madonna! Let me end this post with a christmas song.. Chestnuts roasting on an open fire..... :)