If you are working with large tables that are millions of records or more you will probably want to enable cache restrictions.
The default caching is without restriction. So the whole dataset will load into memory. In one of our cases we had 39 million rows which failed because we ran out of memory. I set it back to 500 mb, this took 5 hours, but it completed. This is too long and we needed another way to trim down the cache. Or the size of the table that was being used for the lookup.
To do this we need to figure out an algorithym that will make the lookup table smaller. Answer:
Stop using the lookup transformation. Capture the max ID of the primary key column into a variable, using Execute SQL Task. It is assumed to be an Identity column on the source.
In order to capture the ID we had to do a workaround with two variables.
- capture the variable into an Object Type variable
- Use a script to convert the object variable into an Int64 variable.
- Use the Int64 Variable in a conditional split data flow task
In the conditional Split compare the max ID to the current ID coming through the dataflow.
- if the current ID is greater than the max ID found then do INSERT
- otherwise do an UPDATE