Quantcast
Channel: Yet Another Math Programming Consultant
Viewing all articles
Browse latest Browse all 51

Simultaneous equation models and data errors

$
0
0
My experience is that using (optimization) models is a great way to provide quality assurance on the data. Data collection can be very difficult and expensive. It can also be quite easy to have errors cropping up somewhere along the way. When using an optimization model, we will get hammered by such errors.

The models I work on are largely based on simultaneous equations. In my world, an optimization problem is a system of equations plus an objective. My argument here is that such models are very sensitive to data errors. Much more so than, say, statistical analyses or machine learning models. That may sound really bad. But we also can turn this around and say that simultaneous equation models can be fantastic tools to stress test data sets.

Here I try to demonstrate the issue with a small model. The idea is to introduce a small error in a somewhat large data matrix (i.e., difficult to detect)  and see how this affects the solution of a linear model \(\color{darkblue}A\color{darkred}x=\color{darkblue}b\).

Experiment

  1. Generate random data \(\color{darkblue}A\) and \(\color{darkblue}b\).
  2. Solve \(\color{darkblue}A\color{darkred}x=\color{darkblue}b\).
  3. Introduce an error in \(\color{darkblue}A\).
  4. Resolve and observe the effects on the solution.
  5. Tremble in fear.

Here I just change one number in \(\color{darkblue}A\): I shift the decimal point by one in just one element. For a \(50 \times 50\) matrix this gives the following changes in \(\color{darkred}x\):


----     61 PARAMETER result  collected results

       original   perturbed        diff       %diff   different    signflip

i1       -1.049      -1.2340.18517.624   different
i2       -0.413      -0.2940.11928.851   different
i3        0.034      -0.1630.197572.422   different    signflip
i4       -0.761      -0.6530.10814.165   different
i5       -0.569      -0.5460.0234.002
i6       -0.333      -0.2830.05015.110   different
i7        0.3300.3710.04112.302   different
i8       -0.128      -0.0810.04736.661   different
i9        0.3580.2590.09927.651   different
i10      -0.279      -0.4130.13448.130   different
i11      -0.305      -0.4180.11337.000   different
i12       0.6650.5680.09714.627   different
i13      -0.413      -0.5490.13632.948   different
i14      -0.885      -0.9890.10511.833   different
i15      -0.552      -0.5420.0101.793
i16      -0.652      -0.6910.0395.982
i17      -0.244      -0.1620.08233.752   different
i18       0.5960.4520.14424.186   different
i19      -0.143      -0.0180.12487.086   different
i20      -0.465      -0.4660.0020.378
i21      -0.0800.1850.265331.702   different    signflip
i22       0.7670.7890.0222.890
i23       0.3370.2870.04914.672   different
i24      -0.199      -0.1200.07939.891   different
i25      -0.800      -0.8100.0101.225
i26      -0.567      -0.5570.0091.672
i27      -0.341      -0.3670.0267.583
i28      -0.780      -0.9210.14118.025   different
i29       0.2590.2990.04015.386   different
i30       0.2190.0770.14365.061   different
i31      -0.570      -0.8470.27648.410   different
i32       0.2350.2440.0093.896
i33       0.081      -0.0420.123151.192   different    signflip
i34       0.0990.0390.06060.201   different
i35      -0.106      -0.0520.05350.595   different
i36       0.2130.2460.03215.176   different
i37       0.3100.3040.0051.771
i38      -0.0120.2250.2381927.458   different    signflip
i39      -0.752      -0.6770.0759.989
i40       0.1350.0190.11786.159   different
i41       0.2700.1790.09033.449   different
i42       0.3080.4910.18359.361   different
i43      -0.416      -0.4280.0122.953
i44       0.0930.4190.326349.219   different
i45      -0.033      -0.1630.130399.702   different
i46       0.3240.1510.17353.330   different
i47      -0.041      -0.1340.094229.921   different
i48       0.1400.1900.05035.987   different
i49       0.8511.0240.17320.391   different
i50      -0.210      -0.1090.10147.879   different

The values \(\color{darkred}x_i\) that change by more than 10% are marked as "different". Notice that quite a few solution values show a change in sign. 

We somehow have the intuition that a few isolated errors have limited effect. This is often not at all the case, as is demonstrated here. A database with, say, a 0.01% error rate sounds good. But it isn't. 

The problem sketched here is made much worse if data comes from different sources (which is usually the case). Many economists involved in modeling spend considerable time creating consistent and calibrated data sets. This is why. Another big concern is using very large data sets: the probability of encountering multiple issues quickly converges to 1, and the effort needed to detect and fix things increases substantially.  

Here I considered a numerical error. In practice, we also see many other errors. Missing data, suppressed data, duplicates, logical errors, data extraction errors, inconsistent use of units, undocumented exceptions, classification inconsistencies, truncation errors, software limits, differences in locale settings, and definitional problems are just a start. Just something silly like NULL v.s. 0 v.s. NA can be a major headache.

I am also very afraid to use live databases. To make runs reproducible, we need to work with fixed data, not something that can change without us even knowing. This can easily lead to losing your sanity and subsequent involuntary commitment. My approach: always take a snapshot and work with that. If technically possible, make the data read-only. In other words, stale data is underappreciated.

The same story can be told for production planning models. Even if they tell me, "our data is very good", it is almost guaranteed that I will find data errors during the development of the model. 

Conclusion


The goal of this little experiment is to scare the *** out of you. Just one small error in the data can lead to total disaster. I think everyone knows this in the back of one's mind. However, to be confronted with this phenomenon this way is certainly revealing.

There are many textbooks on optimization. Admittedly, I have read only a tiny fraction of them. This is a subject that is hardly mentioned, if at all. This inattention to this subject is not warranted: these data problems are a major issue when doing real optimization. A possible reason is that authors have more of a theoretical background in optimization than actually spent time in the trenches.

Appendix: GAMS model


$onText

 

  Experiment:

  show the difference in results for Ax=b when we change one element in

  A a little bit.

 

$offText

 

*--------------------------------------------------------

data

*--------------------------------------------------------

 

set i /i1*i50/;

alias(i,j);

 

parameter A(i,j),b(i);

A(i,j) = uniform(-10,10);

b(i) = uniform(-10,10);

 

 

*--------------------------------------------------------

solve Ax=b

*--------------------------------------------------------

 

variable x(i);

 

Equation

   lineq(i'linear equations'

;

 

lineq(i)..  sum(j, A(i,j)*x(j)) =e= b(i);

 

model m /all/;

solve m using cns

display x.l;

 

parameter result(i,*'collected results';

result(j,'original') = x.l(j);

 

*--------------------------------------------------------

introduce an error in A

*--------------------------------------------------------

 

shift decimal point by one in one single number

a('i3','i5') = a('i3','i5')/10;

 

solve m using cns;

display x.l;

result(j,'perturbed') = x.l(j);

display result;

 

*--------------------------------------------------------

report differences in solution x

*--------------------------------------------------------

 

acronym different,signflip;

result(j,'diff') = abs(result(j,'original')-result(j,'perturbed'));

result(j,'%diff') = 100*result(j,'diff')/max(0.0001,abs(result(j,'original')));

result(j,'different')$(result(j,'%diff')>10) = different;

result(j,'signflip')$(result(j,'original')*result(j,'perturbed') < 0) = signflip;

display result;

 

 


Viewing all articles
Browse latest Browse all 51

Trending Articles