Arrays - Benford's Law
Arrays help us to store and work with groups of data of the same type. The data is stored in consecutive memory spaces which can be accessed by using the name of the array and indexes or subscripts that indicate the position where the data is stored. Repetition structures provide us a simple way of accessing the data within an array. In today's laboratory experience, you will practice the use of counters and one dimensional arrays to implement a program in which you will use Benford’s Law to detect files with bogus data.
Objectives:
Practice using arrays of counters to determine data frequency in a file.
Detect the use of bogus data using Benford's Law and the leading digit frequency distribution.
Practice reading data from text files.
Pre-Lab
Before coming to the laboratory, you should have:
Learned how to extract the leading digit (first digit from left to right) of an integer read from a file.
Reviewed basic array and counter concepts.
Reviewed text file input in C++.
Studied the concepts and instructions for this laboratory session.
Taken the Pre-Lab quiz, available in Moodle.
As part of your new job as IT auditor you suspect that someone in the Chicago Transit Authority (CTA) has been tampering with the information systems and changing the data files that contain the bus route's daily totals. You are given five text files that contain daily totals for each of the CTA’s bus routes and you must determine if one or more of the files contain bogus data. In this laboratory experience, you will implement a program that will help you determine which of the file(s) contain bogus data using Benford's Law, a property that is observed in many real-life sources of data.
What is Benford’s Law? (adapted from the ISACA journal [1])
Benford’s Law, named for physicist Frank Benford, who worked on the theory in 1938, is the mathematical theory of leading digits. Specifically, in data sets, the leading digit(s) is (are) distributed in a specific, non uniform way. While one might think that the number 1 would appear as the first digit 11 percent of the time (i.e., one of nine possible numbers), it actually appears about 30 percent of the time (see Figure 1). The number 9, on the other hand, is the first digit less than 5 percent of the time. The theory covers the first digit, second digit, first two digits, last digit and other combinations of digits because the theory is based on a logarithm of probability of occurrence of digits.
Figure 1. Distribution of the leading digit in a real data set according to Benford’s Law. Taken from [1].
How to Use Benford's Law to Spot Bogus Data
In this laboratory experience you will use Benford's Law applied only to the leading digit. To do this, you need to determine the frequency of each leading digit in the numbers of the file. Suppose that you are given a file that contains the following integer numbers:
890 3412 234 143 112 178 112 842 5892 19
777 206 156 900 1138 438 158 978 238 192
As you read each number , you determine its leading digit (how to extract the leading digit is left as an exercise for you). You must also keep track of how many times the same leading digit appears in the data set. The easiest way to keep track of how many times you find a leading 1, a leading 2, ... a leading 9, is to use an array of counters. This array of counters is simply an array of integers where each element is incremented whenever you find a certain leading digit. For instance, for this exercise the array of counters can be an array of 10 integers, initialized to 0
.
Figure 2. Array of 10 integers initialized to 0
.
Every time a leading digit d
is found, the element with index d
is incremented. For example, after reading the numbers 890
, 3412
, 234
, 143
, and 112
, the array content would be:
Figure 3. Contents of the array after reading 890, 3412, 234, 143, 112 and counting their leading digits.
After reading through all the data, the content of each array element will be the number of times that leading digit appears in the data.
Figure 4. Contents of the array after reading all the data.
Frequency of Occurrence
The frequency of occurrence is defined as the ratio of times that a digit appears divided by the total number of data. For example, the frequency of leading digit 1
in the example would be computed as . Histograms are the preferred visualization of frequency distributions in a data set. In essence, a histogram is a bar chart where the -axis is the frequency and a vertical bar is drawn for each of the counted classifications (in our case, for each digit).
Figure 5. Histogram of the frequency of leading digits in the example data.
data.txt
contains the numbers (one below the other). What would be the output of the following program?
data.txt
only contains one number greater than 5, so the result of the program would be 1
nums.txt
contains the numbers (one below the other). What would be the output of the following program?
2 3 4 5 6
2 0 1 2 0
2 3 0 1 2
-1 0 1 2 3
nums.txt
and prints the result of the remainder when dividing by 3. Therefore the result is 2 0 1 2 0
, which are the remainders of 2 / 3, 3 / 3, 4 / 3, 5 / 3, and 6 / 3.
A
just before executing the instruction return 0
?
i
: 0, 1, 2, 3, 4, 5, 6. The instruction A[i%3]++
increases the content of the following elements of A: A[0]
, A[1]
, A[2]
, A[0]
, A[1]
, A[2]
, and once again A[0]
. Therefore, the array A
contains:
C
of size 10. After finishing counting, the contents of C
are {0, 1, 1, 0, 0, 0, 0, 0, 0, 0}
. What is the frequency of the occurrence of the digit 0?
C
) over the total of data values (10 values): .
Reading Data From Text Files in C++
This laboratory experience requires you to read data from a text file. You can skip to the next section if you feel that your file reading skills are competent. Otherwise, read on...
C++ provides functions to read and write data to/from files. In this laboratory experience, you will be using one of the most rudimentary file input/output schemes provided in C++ to read/write from text files. Text files consist exclusively of ASCII characters which represent data in any of the primitive types provided by C++. Typically, the values are separated by spaces. For instance, let's assume that the file nameAge.txt
contains some data about names and ages.
Tomas 34
Marta 55
Remigio 88
Andrea 43
To read a text file in C++, we need to have a sense of how it is organized and what type of data you would like to read. The example nameAge.txt
file contains four lines, each consisting of a string and an integer. Here is a simple program to read that file entirely while printing its content. Read the comments to understand the various parts.
#include <iostream>
// fstream is the header file that contains classes, functions and
// objects to deal with file input and output.
#include <fstream>
using namespace std;
int main(){
// We shall use these two variables to assign the values read
// from each line in the file.
string name;
int age;
// This is the object that will represent the file.
ifstream inFile;
// We call the open function to open the input file `nameAge.txt`
inFile.open("nameAge.txt");
// We check if the file was correctly opened
if (!inFile.is_open()) {
cout << "Error openning file nameAge.txt\n";
exit(1);
}
// While there is data in the file, read a string and an int.
// Notice how the `>>` symbol is used, similar to when using cin
while (inFile >> name >> age) {
cout << name << " : " << age << endl;
}
// Close the file.
inFile.close();
return 0;
}
The ifstream
object is used for reading a text file sequentially. It keeps track of the next position in the file that should be read. Each time that a data is read from the file (using inFile >> ____
) it advances its position so that the next inFile >> ___
reads the next data and so forth.
Notice the line inFile >> name >> age
. This instruction accomplishes several tasks:
- It reads a
string
and anint
from the file (if available) and assigns them to the variablesname
andage
. - If both data were read, the expression evaluates to
true
, thus entering the while block. - If both data could not be read, the expression evaluates to
false
thus ending the while block.
Here are some code snippets for common reading tasks. Observe that all of them:
- Create an
ifstream
object, call theopen
function and check if the file is opened correctly. - Create one or more variables to assign the values that are read from the file.
- Implement a loop which repeats until no more data is available in the file.
close
the file at the end.
Example 1: Read a file that consists only of integers, and accumulate their values into a sum.
ifstream inFile;
int n;
int accum = 0;
inFile.open("nums.txt");
if (!inFile.is_open()) {
cout << "Error opening file nums.txt\n";
exit(1);
}
while (inFile >> n) {
accum = accum + n;
}
cout << "Total: " << accum << endl;
inFile.close();
Example 2: Count the number of lines in a file that consists of names. Then choose the name at the center line.
ifstream inFile;
string name;
int ctr = 0;
inFile.open("names.txt");
if (!inFile.is_open()) {
cout << "Error opening file names.txt\n";
exit(1);
}
while (inFile >> name) {
ctr++;
}
cout << "Total number of lines: " << ctr << endl;
// These two commands "rewind" the file so that we can start
// reading again from the beginning.
inFile.clear();
inFile.seekg(0);
for (int i = 0; i <= ctr / 2; i++) {
inFile >> name;
}
cout << "The name at the position " << ctr / 2 << ": " << name << endl;
inFile.close();
Laboratory session:
Exercise 1 - Understand the Data Files and the Provided Code
Instructions
Load the project
BenfordsLaw
intoQtCreator
. There are two ways to do this:- Using the virtual machine: Double click the file
BenfordsLaw
.prolocated in the folder
/home/eip/labs/arrays-benfordslaw` of your virtual machine. - Downloading the project’s folder from
Bitbucket
: Use a terminal and write the commandgit clone http:/bitbucket.org/eip-uprrp/arrays-benfordslaw
to download the folderarrays-benfordslaw
fromBitbucket
. Double click the fileBenfordsLaw.pro
located in the folder that you downloaded to your computer.
- Using the virtual machine: Double click the file
The text files
cta-a.txt
,cta-b.txt
,cta-c.txt
,cta-d.txt
, andcta-e.txt
in thedata
directory contain either real or bogus data. Each line of the file specifies the bus route code and the number of users for that route on a certain day. Open the filecta-a.txt
to understand the data format. This will be important when reading the file sequentially using C++. Notice that some of the route codes contain characters.Open the
main.cpp
file. Study themain
function and make sure that you understand all of its parts. In essence, the providedmain
function creates a window that displays a histogram like the one shown in Figure 6.Figure 6. Result of running the provided program. A histogram is displayed using the data in the arguments
histoNames
andhistoValues
.
In the provided code, notice that the data in the arrays histoNames
and histoValues
were assigned directly in their declarations. For the provided files, your program should compute the frequency of occurrence of the leading digits and then display their histogram by using the histo
method.
Exercise 2 - Implement a Program to Detect Bogus Data in the Data Files
Instructions
Using the provided
main
function as inspiration, add the necessary code to implement a program that reads a text data file like the ones in thedata
folder and determines the frequency of occurrence of the leading digits of the second column of the files. Compute the frequency of occurrence as it was explained before Figure 5.Once your program has the frequency of occurrence of the leading digits, use the method
histo
to display the histogram. Run the program for each of the text files. Based on the leading digit frequency distribution for each file, decide if (according to Benford’s Law) the file contains real or bogus data.
Deliverables
Use "Deliverable 1" in Moodle to upload the
main.cpp
file with the modifications you made in Exercise 2. Remember to use good programming techniques, include the names of the programmers involved, and document your program.Use "Deliverable 2" in Moodle to upload a pdf file that contains screen shots of the histograms produced after analyzing each text file. Please caption each figure with the name of the text file and provide your decision as to whether the file contained real or bogus data.