I have been thinking today about the gcc introspector.
That I should optimize my time spent on a job, not really the execution time.
The Human time spent, and the skills needed to do a job, the learning effort involved. These are the issues.
The program itself that is the result of the work is another issue, it has also a timeframe, the power usage, time to execute and the space usage that the running program uses.
Also there are issues like downtime, programs breaking and security.
So, here are my ideas :
One, the compiler should be able to read some source code or enrich or replace the data from the doxygen xml. That means that given an introspection data dump, we should be able to read in the doxygen html and add in the missing from the compiler. We would also be able to use the doxygen output as input to the gcc introspector directly.
Secondly, There is the human issue of types. The types of a program do not matter to the computer. Of course when those types are used wrongly, the computer will crash. But lets get back to the idea. Types are basically data formats, but also meanings. We would like to be able to find similar types, compare types and visualize the types.
Third, we would like to see the types and how they are related.
We can imagine a rdf datastore of a program as a graph of all connections in the program. We will be able to see how a type is used and the code that goes from it.
What is important in this equasion is the runtime path. We want to see what paths are followed in the program, that means the code that is executed.
This turns into a big debugging exercise, the debugger will be able to show us that. The profiler as well. Dtrace on solaris. Print statements in the log file. All of those things are indications of what a program does.
But to truly understand a program we need to know the following :
1. The specfication of the program that defines the inputs and outputs.
2. The test cases that cover the entire functionality of that program.
3. The audit that shows how the source code relates to the specification
4. The audit that show how the test executes implement that code.
Now we want to get down to the level of instructions being executed on a machine. Lets say we have a virtual machine, and we can add in all types of data and annotate each instruction.
So we would have for each byte of the input data to the test case, and I am thinking of a simple system that reads from stdin and writes to stdout. We can however trace each input that is read. We would assume that all reads of data are from the test environment for the moment. So we would have at a given time, a read of some byte of information from a file at a position.
Now the contents of the file are only interesting in that we would like to trace for each byte how it is processed. For this we would define the information to understand the type of data as the entire set of instructions executed on it, and all the data that is needed for this.
We would have a block of instructions, or even a DAG with loops of the instructions if there is a loop. We would see the instructions executed, the registers used (where that data comes from) and memory used. Cache pages accessed and all that. This could be provided by the virtual machine.
Lets imagine that we are running a version of qemu or some similar tool with full debug information.
Now lets continue, I would like to define metadata as overhead, administration data that should be minimised. Metadata is like a key to a lock separated from the data for some reason, but they belong together. Lets say that the universe contains the metadata, and we need to collect it to understand a given problem.
Now the domain specific problem is not metadata, so lets say, I am working on the problem of creating a video for youtube from a mp3 and a jpg. That is the domain data. All of that data from outside that is processed in the program belongs to the runtime data. We can imagine a stream of data from the input files flowing to the output files. Then we have domain specific information about the codecs, that is also part of the domain, but removed a level. For the movie, it could be considered metadata. The parameters of the codec. In full, the entire source code and all the processing of the program is the metadata. For example, if you want to know why there is a glitch in the movie at a certain point, you need to maybe also know what was going on at that point in the program. It might be an environmental issue, like the power being shut off.
So, we want to be able to trace the entire data flow from inside to outside. For full understanding, we want to trace how the data gets into the program, for example if an integer is being loaded into the register, where does this come from? Who is the person who added that to the source code, what revision? What was the change supposed to do? What was the specification of it?
Now we would like to model the input data. Lets say in our example we can say we have frames of data in a movie. We want to be able to replace a given frame of data in memory with a set of frames. We can abstract that data by removing it. Reducing it. We would say it is the nth frame of data from the audio input.
That is domain specific, and we would have to model the domain specific types
to say such things. We therefore need a specification of the program, a bug report or some type of input as to what the meaning of it is.
Now we can imagine the pipeline of the processing of a program. We have the flow of input to output using registers and instructions on the way.
We have the flow of values from code being loaded.
Next we would look at optimizations of the compiler, we have changes to the compiler that flow into the instructions of the code. Different compiler switches flow into the instructions and registers used.
The code of the compiler is also flowing from the specification of the chips.
Some times we do not even have a public document of the chip or the language so we would take the changes to the compiler as the public documentation.
This is however the core problem as to why the compiler is so cryptic,
if the specification of the machine is secret, the specification of the language as well, then why should the compiler be easy to understand? there is a definite conflict between forces here at play. It is market economics meeting FLOSS.
So, we can then for each byte of the output file have entire trace to all the sources of it. Source code, Input files, Environmental changes and all.
For this to be processed efficiently we will have to come up with some real optimizations, but it is my basic vision of what the introspector is.
Mike