vendredi 18 août 2017

Understanding the basics behind a profiling report - Part 3: other profiling data

Hi!

Third and last part about the textual profiling report ! (here are the links to the part 1, giving general informations about the profiling, and to the part 2 about the statistical data and distribution of the execution time)

Part 3 - Other useful data about the profiling 


So, this is how the last part looks when I profiled 1 to: 1000000000 do: [ :i | Object new]. As you can see, there are 4 parts, each dedicated to a specific aspect of the VM.

**Memory**
old +2,007,520 bytes
free +0 bytes

**GCs**
full 0 totalling 0ms (0.000% elapsed time)
scavenges 544 totalling 79ms (5.607% elapsed time), avg 0.145ms
tenures 10,893 (avg 0 GCs/tenure)
root table 0 overflows

**Compiled Code Compactions**
0 totalling 0ms (0.000% elapsed time)

**Events**
Process switches 6 (4 per second)
ioProcessEvents calls 69 (49 per second)
Interrupt checks 1170 (830 per second)
Event checks 1170 (830 per second)
Stack overflows 0 (0 per second)
Stack page divorces 0 (0 per second)

The first two parts are also available in the report provided when profiling with MessageTally, and this and this post explain what they are about.
Let's have a quick look at each of these parts, especially the ones not appearing in the MessageTally report:

1 & 2. Memory and GC's


**Memory**
old +2,007,520 bytes
free +0 bytes

**GCs**
full 0 totalling 0ms (0.000% elapsed time)
scavenges 544 totalling 79ms (5.607% elapsed time), avg 0.145ms
tenures 10,893 (avg 0 GCs/tenure)
root table 0 overflows

I will not go into the details of these parts as these are already explained in the articles I am referring to at the beginning of this post.

3. Compiled Code Compactions


**Compiled Code Compactions**
0 totalling 0ms (0.000% elapsed time)

This deals with the number of compiled code compactions. As already told in the installation tutorial of the VMProfiler, this kind of compaction occurs when the Cog method zone is full. A high (and time consuming) number of compactions may show that you need to increase the Cog Method zone size.

4. Events


**Events**
Process switches 6 (4 per second)
ioProcessEvents calls 69 (49 per second)
Interrupt checks 1170 (830 per second)
Event checks 1170 (830 per second)
Stack overflows 0 (0 per second)
Stack page divorces 0 (0 per second)


This part lists the different kinds of events that could happen during the profiling. 
Process switches is self-explanatory. 
ioProcessEvents calls refers to the amount of calls of the ioProcessEvents function, explained here.
Interrupt checks counts the amount of interrupt checks (for instance, if cmd+. was pressed).
Event checks counts the amount of event checks (for instance, if a mouse button was clicked).
Stack overflows happen when no stack page is available.
Stack pages divorce can happen in various cases, but it can be linked with stack overflows (not always!). 

All of these counters could indicate that something is wrong, if their behaviour doesn't match the expected one. 

... OK, nice, but what should I do now ?


This profiler is a really nice tool, especially if you want to understand what's going on on the VM when you execute your code, or if you are contributing to the VM and wants to understand precisely the impacts of your modifications.

The example on which the article is based, while good enough to explain the content of a profiling report, doesn't show what to do when the profiling data are unusual. Indeed, nothing seems to go wrong in this example.
But what means "going wrong"? Typically, it is when you spot what could be uncommon figures regarding the expected behaviour of your code. For instance :
  • a lot of full GC's,
  • a lot of compiled code compactions,
  • a lot of stack overflows,
  • a lot of stack pages divorces,
  • ... or of course, an (unexpected) large amount of time spent in some function(s).
If your program is slow, there are hardly out-of-the-box solutions to make it faster. The solutions to find are usually completely dependent to your case, and I cannot provide generic solutions.

However, there are a few things that you can try if you run into one of these cases. Don't forget to save your work :)
  • If the problem is about an unusual amount of GC's, you can try the solutions described in this article (in the GC tuning part). 
  • About the compiled code compactions, you can try to set a bigger machine code zone, using a similar script (you'll have to restart your image afterwards)Smalltalk vm parameterAt: 47 put: (Smalltalk vm parameterAt: 46) * 2(Parameter 46 being the size of machine code zone, the 47 being the desired size of machine code zone)
  • In general, you can check the different parameters by browsing SmalltalkImage>>vmParameterAt:, but a lot of them are read-only. 

I hope you enjoyed these posts about the report! The next post should deal with the evolution of the granularity of the report.

jeudi 17 août 2017

Understanding the basics behind a profiling report - Part 2: distribution of the execution time

Hi! This article follows directly this one, and explains what's going on in the second part of the profiling report.

First, some things to remember :
1) When profiling the same code several times, you will not get exactly the same results or the exact same percentages. Also, depending on what you are doing in parallel in your image, some extra results may show in the report (it is quite unusual, but it can happen).
2) If you want to profile code that runs for only a short amount of time, you'll need to execute it more than one time to get a significant amount of samples (remember the profiler is cadenced at ~1,3 kHz)
3) If you want to use the profiler to understand why your program is slow, and how you could improve its code, I suggest you first use MessageTally (In Pharo, right-click and "Profile it", in Squeak, right-click, "spy on it"). You can read this chapter  about profiling in the Deep Into Pharo manual, and this post.

Part 2 - Distribution of the execution time



So, let's have a look at this part of the report :

% of generated vm code (% of total) (samples) (cumulative)
25.55%    (24.04%) Behavior>>new (449) (25.55%)
24.36%    (22.91%) UndefinedObject>>DoIt    (428) (49.91%)
22.25%    (20.93%) Behavior>>basicNew (391) (72.17%)
19.07%    (17.93%) PIC basicNew (335) (91.24%)
  5.98%   (  5.62%) PIC initialize (105) (97.21%)
  2.79%   (  2.62%) ProtoObject>>initialize (49) (100.0%)

% of vanilla vm code (% of total) (samples) (cumulative)
34.57%    (  1.50%) scavengeReferentsOf (28) (34.57%)
23.46%    (  1.02%) doScavenge (19) (58.02%)
22.22%    (  0.96%) copyAndForward (18) (80.25%)
  9.88%   (  0.43%) shouldRemapObj (8) (90.12%)
  3.70%   (  0.16%) mapStackPages (3) (93.83%)
  2.47%   (  0.11%) scavengingGCTenuringIf (2) (96.30%)
  1.23%   (  0.05%) addToFreeListbytes (1) (97.53%)
  1.23%   (  0.05%) mapInterpreterOops (1) (98.77%)
  1.23%   (  0.05%) ioUTCMicroseconds (1) (100.0%)

30 samples in the rest    1.61% of total
% of rest (% of total) (samples) (cumulative)
100.0%    (  1.61%) Samples Not In Any Function (30) (100.0%)

It displays how the execution time is divided up between 2 types of code: the generated vm code and the vanilla vm code. In both these parts, the results are displayed the same way : you get a list of functions and how the execution time is divided up between these functions.



1. Generated vm code 


First, the generated vm code part.
It deals with the code generated by the JIT compiler of the VM, that is to say all the functions, enilopmarts/trampolines, PICs present in the machine code zone (remember the "Code size", displayed at the very beginning of the report? This refers to the size of the machine code zone I'm talking about right now).
You can differentiate the different types of functions as follows :

  • Class>>selector refers to a native code function, i.e. a method that has been optimized by the JIT and converted to native code (like the "Behavior>>new" one)
  • PIC selector refers to a PIC, that could either be an open PIC or a closed PIC (like "PIC basicNew") (for more informations about PICs and what they are, you can refer to this, this. This short article is very clear as well)
  • ce... refers to trampolines/enilopmarts (not pictured in this example, but for instance, like "ceTraceStoreTrampoline")

The profiling of [100000000 timesRepeat: [Object new]] gave this result :

% of generated vm code (% of total) (samples) (cumulative)
25.55%    (24.04%) Behavior>>new (449) (25.55%)
24.36%    (22.91%) UndefinedObject>>DoIt    (428) (49.91%)
22.25%    (20.93%) Behavior>>basicNew (391) (72.17%)
19.07%    (17.93%) PIC basicNew (335) (91.24%)
  5.98%   (  5.62%) PIC initialize (105) (97.21%)
  2.79%   (  2.62%) ProtoObject>>initialize (49) (100.0%)


By looking at these results, we can say that 25,5% of the time spent in generated vm code was spent in the Behavior>>new function, corresponding to 24,08% of the total execution time (disregarding the differentiation generated/vanilla). When profiling, 449 samples were referring to Behavior>>new, and that represents 25,55%, cumulated, in generated vm code.
We can roughly trace what happened in the VM when Object new was executed 100000000 times.
Indeed, Object>>new refers to Behavior>>new, and if you browse Behavior>>new, you'll find :

Behavior>>new

^ self basicNew initialize

That's the reason why you come across Behavior>>new, Behavior>>basicNew and ProtoObject>>initialize. As basicNew and initialize are frequently used, they are soon turned into PICs. The profiler is meant to provide the detail whether the PIC is an open or a closed one, but that will be the topic of another post.

2. Vanilla vm code 

So this is it for the generated vm code part, let's have a look at the vanilla vm code part :

% of vanilla vm code (% of total) (samples) (cumulative)
34.57%    (  1.50%) scavengeReferentsOf (28) (34.57%)
23.46%    (  1.02%) doScavenge (19) (58.02%)
22.22%    (  0.96%) copyAndForward (18) (80.25%)
  9.88%   (  0.43%) shouldRemapObj (8) (90.12%)
  3.70%   (  0.16%) mapStackPages (3) (93.83%)
  2.47%   (  0.11%) scavengingGCTenuringIf (2) (96.30%)
  1.23%   (  0.05%) addToFreeListbytes (1) (97.53%)
  1.23%   (  0.05%) mapInterpreterOops (1) (98.77%)
  1.23%   (  0.05%) ioUTCMicroseconds (1) (100.0%)

I am definitely not as proficient as in the generated vm code part. But as usual, when you want to know what these functions do, it is better to have a look directly in the VMMaker package!
Usually, at least a few of these functions will refer to garbage collection. Here, this is indeed the case for the first four functions. So according to these results, we can see that 90,12% of the time spent in the vanilla vm code was spent scavenging (see the 90,12% in the cumulative row). I copied below the comments of these functions.

  • scavengeReferentsOf: inspects all the pointers in referrer.  If any are new objects, it has them moved to FutureSurvivorSpace, and answers truth. If there are no new referents, it answers falsity
  • doScavenge: the inner shell for scavenge, abstracted out so globalGarbageCollect can use it.
  • copyAndForward: copies a survivor object either to futureSurvivorSpace or, if it is to be promoted, to oldSpace.
  • shouldRemapObj: Answer if the obj should be scavenged (or simply followed)

If you want to have more global informations about memory management,  you can directly have a look in the third part of the report, in which you will find the total amount of performed full GC's, scavenges, tenures...

3. Samples in the rest 

The last samples, that could not be linked to a known symbol, are gathered in this part. Usually, this represents a non-significant amount of samples.

30 samples in the rest    1.61% of total
% of rest (% of total) (samples) (cumulative)
100.0%    (  1.61%) Samples Not In Any Function (30) (100.0%)


The next article will be about the third part of the report, which gather data about memory usage, among other things. It will also provide some insights about how to tune the VM according to the results.
Thanks for reading!

Understanding the basics behind a profiling report - Part 1: general data

Hi !

So in my last post about the VM profiler, I mostly described how to install the profiler in Pharo and roughly how to read the textual report. 
This time, I will go through a report in detail and I will try to explain how does the profiling work and refer to relevant articles. The profiling report can be divided in 3 parts, and each part will be the subject of a blog post. I won't go into the details of the VMProfiler code ; if you're interested in that part or if you already are familiar with the VM profiler, I suggest you read this article from Eliot Miranda which gives you a more precise insight about how the profiler works.

So, let's begin :


Part 1 - General data about the profiling 


Let's profile the following block: 1 to: 100000000 do: [ :i | Object new ]. What does the first part of the report say ?

/Users/sophiekaleba/Documents/WIP-part1/Pharo/Pharo.app/Contents/MacOS/Pharo 2017-07-10 15:35:15
eden size: 3,801,936  stack pages: 50  code size: 1,048,576
1 to: 100000000 do: [ :i | Object new ]
gc prior.  clear prior.
1.409 seconds; sampling frequency 1326 hz
1838 samples in the VM (1868 samples in the entire program)  98.39% of total
1757 samples in generated vm code 95.59% of entire vm (94.06% of total)
81 samples in vanilla vm code   4.41% of entire vm (  4.34% of total)


/Users/sophiekaleba/Documents/WIP-part1/Pharo/Pharo.app/Contents/MacOS/Pharo 2017-07-10 15:35:15
That's the path of the VM used, and the date and time of the profiling. Fairly easy for now.

eden size: 3,801,936  stack pages: 50  code size: 1,048,576
Ok. A bit more complex for the uninitiated.
The eden size (in bytes) refers to the size of the "young space" in the virtual machine (Cog) memory. This is the space where newly allocated objects are stored. If you want to learn more about how the Cog memory is divided and why, I deeply recommend you read this article
This data is relevant in this report because, in some cases, the size of the eden can cause an increase of execution time : for instance, if the Eden is too small, and the running program is actually allocating a very large amount of objects, it can trigger a lot of scavenges (basically, moving the stored objects in a particular space in the Eden). The scavenges are performed once a specific threshold is reached in the eden.

Stack pages. The VM representation of the stack is a linked list of stack pages. By default, the number of stack pages is set to 50. Like the eden size, this piece of information is significant, as in some specific cases, 50 pages are not enough and could lead to an increase in execution time.

Code size. This one deals again with the Cog memory (you can refer to the same article). It refers to the size of the Cog method zone, i.e. the memory space containing trampolines/enilopmarts, Cog methods, CogFullBlocks, PICs, and a linked list of elements with young referents. 
This zone, when full, can be compacted which can have an impact on execution time, especially if compactions occur frequently. It is possible to modify the Code size to reduce the amount of compactions and speed up the execution time.

 1 to: 100000000 do: [ :i | Object new ] 
The block you profiled.

gc prior.  clear prior.
gc prior. a GC is performed just before the profiling starts, to make sure we start from a fresh environment (so we know that if a GC is performed during the profiling, it is actually caused by the execution of the block )
clear prior. Besides the GC, some other operations are performed before the profiling starts. It includes making sure that the buffer containing the samples is cleared, initializing the profiling time and the data structure (a Bag) that will later hold the samples.

1.409 seconds; sampling frequency 1326 hz
Total time of the profiling. And the sampling frequency, usually between 1,3 and 1,4 Khz. 

1838 samples in the VM (1868 samples in the entire program)  98.39% of total
Total number of samples gathered during the profiling. A sample is basically an instruction pointer, pointing at the currently executed function.  So we can simplistically say that a sample refers to a function.  I will be more specific about that part in another post (soon). 
A difference is made between the "samples in the VM" and the "samples in the entire program" : indeed, some samples are not referring to functions in the VM, or are simply not identified. These are the 30 samples difference you can notice above. 
All the samples are then split between 3 categories :
  • the samples in generated vm code
  • the samples in vanilla vm code
  • the remaining samples

1757 samples in generated vm code 95.59% of entire vm (94.06% of total)
This is the total amount of samples in generated vm code, that is to say all the samples referring to a function / trampolines / enilopmarts... in the Cog method zone.

81 samples in vanilla vm code   4.41% of entire vm (  4.34% of total)
This is the total amount of samples in vanilla vm code, that is to say all the samples referring to a function in the C code of the VM (typically, the interpreter and the garbage collector).

That's all for this part, if you spot any mistake or have any question, feel free to comment !