vendredi 18 août 2017

Understanding the basics behind a profiling report - Part 3: other profiling data

Hi!

Third and last part about the textual profiling report ! (here are the links to the part 1, giving general informations about the profiling, and to the part 2 about the statistical data and distribution of the execution time)

Part 3 - Other useful data about the profiling 


So, this is how the last part looks when I profiled 1 to: 1000000000 do: [ :i | Object new]. As you can see, there are 4 parts, each dedicated to a specific aspect of the VM.

**Memory**
old +2,007,520 bytes
free +0 bytes

**GCs**
full 0 totalling 0ms (0.000% elapsed time)
scavenges 544 totalling 79ms (5.607% elapsed time), avg 0.145ms
tenures 10,893 (avg 0 GCs/tenure)
root table 0 overflows

**Compiled Code Compactions**
0 totalling 0ms (0.000% elapsed time)

**Events**
Process switches 6 (4 per second)
ioProcessEvents calls 69 (49 per second)
Interrupt checks 1170 (830 per second)
Event checks 1170 (830 per second)
Stack overflows 0 (0 per second)
Stack page divorces 0 (0 per second)

The first two parts are also available in the report provided when profiling with MessageTally, and this and this post explain what they are about.
Let's have a quick look at each of these parts, especially the ones not appearing in the MessageTally report:

1 & 2. Memory and GC's


**Memory**
old +2,007,520 bytes
free +0 bytes

**GCs**
full 0 totalling 0ms (0.000% elapsed time)
scavenges 544 totalling 79ms (5.607% elapsed time), avg 0.145ms
tenures 10,893 (avg 0 GCs/tenure)
root table 0 overflows

I will not go into the details of these parts as these are already explained in the articles I am referring to at the beginning of this post.

3. Compiled Code Compactions


**Compiled Code Compactions**
0 totalling 0ms (0.000% elapsed time)

This deals with the number of compiled code compactions. As already told in the installation tutorial of the VMProfiler, this kind of compaction occurs when the Cog method zone is full. A high (and time consuming) number of compactions may show that you need to increase the Cog Method zone size.

4. Events


**Events**
Process switches 6 (4 per second)
ioProcessEvents calls 69 (49 per second)
Interrupt checks 1170 (830 per second)
Event checks 1170 (830 per second)
Stack overflows 0 (0 per second)
Stack page divorces 0 (0 per second)


This part lists the different kinds of events that could happen during the profiling. 
Process switches is self-explanatory. 
ioProcessEvents calls refers to the amount of calls of the ioProcessEvents function, explained here.
Interrupt checks counts the amount of interrupt checks (for instance, if cmd+. was pressed).
Event checks counts the amount of event checks (for instance, if a mouse button was clicked).
Stack overflows happen when no stack page is available.
Stack pages divorce can happen in various cases, but it can be linked with stack overflows (not always!). 

All of these counters could indicate that something is wrong, if their behaviour doesn't match the expected one. 

... OK, nice, but what should I do now ?


This profiler is a really nice tool, especially if you want to understand what's going on on the VM when you execute your code, or if you are contributing to the VM and wants to understand precisely the impacts of your modifications.

The example on which the article is based, while good enough to explain the content of a profiling report, doesn't show what to do when the profiling data are unusual. Indeed, nothing seems to go wrong in this example.
But what means "going wrong"? Typically, it is when you spot what could be uncommon figures regarding the expected behaviour of your code. For instance :
  • a lot of full GC's,
  • a lot of compiled code compactions,
  • a lot of stack overflows,
  • a lot of stack pages divorces,
  • ... or of course, an (unexpected) large amount of time spent in some function(s).
If your program is slow, there are hardly out-of-the-box solutions to make it faster. The solutions to find are usually completely dependent to your case, and I cannot provide generic solutions.

However, there are a few things that you can try if you run into one of these cases. Don't forget to save your work :)
  • If the problem is about an unusual amount of GC's, you can try the solutions described in this article (in the GC tuning part). 
  • About the compiled code compactions, you can try to set a bigger machine code zone, using a similar script (you'll have to restart your image afterwards)Smalltalk vm parameterAt: 47 put: (Smalltalk vm parameterAt: 46) * 2(Parameter 46 being the size of machine code zone, the 47 being the desired size of machine code zone)
  • In general, you can check the different parameters by browsing SmalltalkImage>>vmParameterAt:, but a lot of them are read-only. 

I hope you enjoyed these posts about the report! The next post should deal with the evolution of the granularity of the report.

jeudi 17 août 2017

Understanding the basics behind a profiling report - Part 2: distribution of the execution time

Hi! This article follows directly this one, and explains what's going on in the second part of the profiling report.

First, some things to remember :
1) When profiling the same code several times, you will not get exactly the same results or the exact same percentages. Also, depending on what you are doing in parallel in your image, some extra results may show in the report (it is quite unusual, but it can happen).
2) If you want to profile code that runs for only a short amount of time, you'll need to execute it more than one time to get a significant amount of samples (remember the profiler is cadenced at ~1,3 kHz)
3) If you want to use the profiler to understand why your program is slow, and how you could improve its code, I suggest you first use MessageTally (In Pharo, right-click and "Profile it", in Squeak, right-click, "spy on it"). You can read this chapter  about profiling in the Deep Into Pharo manual, and this post.

Part 2 - Distribution of the execution time



So, let's have a look at this part of the report :

% of generated vm code (% of total) (samples) (cumulative)
25.55%    (24.04%) Behavior>>new (449) (25.55%)
24.36%    (22.91%) UndefinedObject>>DoIt    (428) (49.91%)
22.25%    (20.93%) Behavior>>basicNew (391) (72.17%)
19.07%    (17.93%) PIC basicNew (335) (91.24%)
  5.98%   (  5.62%) PIC initialize (105) (97.21%)
  2.79%   (  2.62%) ProtoObject>>initialize (49) (100.0%)

% of vanilla vm code (% of total) (samples) (cumulative)
34.57%    (  1.50%) scavengeReferentsOf (28) (34.57%)
23.46%    (  1.02%) doScavenge (19) (58.02%)
22.22%    (  0.96%) copyAndForward (18) (80.25%)
  9.88%   (  0.43%) shouldRemapObj (8) (90.12%)
  3.70%   (  0.16%) mapStackPages (3) (93.83%)
  2.47%   (  0.11%) scavengingGCTenuringIf (2) (96.30%)
  1.23%   (  0.05%) addToFreeListbytes (1) (97.53%)
  1.23%   (  0.05%) mapInterpreterOops (1) (98.77%)
  1.23%   (  0.05%) ioUTCMicroseconds (1) (100.0%)

30 samples in the rest    1.61% of total
% of rest (% of total) (samples) (cumulative)
100.0%    (  1.61%) Samples Not In Any Function (30) (100.0%)

It displays how the execution time is divided up between 2 types of code: the generated vm code and the vanilla vm code. In both these parts, the results are displayed the same way : you get a list of functions and how the execution time is divided up between these functions.



1. Generated vm code 


First, the generated vm code part.
It deals with the code generated by the JIT compiler of the VM, that is to say all the functions, enilopmarts/trampolines, PICs present in the machine code zone (remember the "Code size", displayed at the very beginning of the report? This refers to the size of the machine code zone I'm talking about right now).
You can differentiate the different types of functions as follows :

  • Class>>selector refers to a native code function, i.e. a method that has been optimized by the JIT and converted to native code (like the "Behavior>>new" one)
  • PIC selector refers to a PIC, that could either be an open PIC or a closed PIC (like "PIC basicNew") (for more informations about PICs and what they are, you can refer to this, this. This short article is very clear as well)
  • ce... refers to trampolines/enilopmarts (not pictured in this example, but for instance, like "ceTraceStoreTrampoline")

The profiling of [100000000 timesRepeat: [Object new]] gave this result :

% of generated vm code (% of total) (samples) (cumulative)
25.55%    (24.04%) Behavior>>new (449) (25.55%)
24.36%    (22.91%) UndefinedObject>>DoIt    (428) (49.91%)
22.25%    (20.93%) Behavior>>basicNew (391) (72.17%)
19.07%    (17.93%) PIC basicNew (335) (91.24%)
  5.98%   (  5.62%) PIC initialize (105) (97.21%)
  2.79%   (  2.62%) ProtoObject>>initialize (49) (100.0%)


By looking at these results, we can say that 25,5% of the time spent in generated vm code was spent in the Behavior>>new function, corresponding to 24,08% of the total execution time (disregarding the differentiation generated/vanilla). When profiling, 449 samples were referring to Behavior>>new, and that represents 25,55%, cumulated, in generated vm code.
We can roughly trace what happened in the VM when Object new was executed 100000000 times.
Indeed, Object>>new refers to Behavior>>new, and if you browse Behavior>>new, you'll find :

Behavior>>new

^ self basicNew initialize

That's the reason why you come across Behavior>>new, Behavior>>basicNew and ProtoObject>>initialize. As basicNew and initialize are frequently used, they are soon turned into PICs. The profiler is meant to provide the detail whether the PIC is an open or a closed one, but that will be the topic of another post.

2. Vanilla vm code 

So this is it for the generated vm code part, let's have a look at the vanilla vm code part :

% of vanilla vm code (% of total) (samples) (cumulative)
34.57%    (  1.50%) scavengeReferentsOf (28) (34.57%)
23.46%    (  1.02%) doScavenge (19) (58.02%)
22.22%    (  0.96%) copyAndForward (18) (80.25%)
  9.88%   (  0.43%) shouldRemapObj (8) (90.12%)
  3.70%   (  0.16%) mapStackPages (3) (93.83%)
  2.47%   (  0.11%) scavengingGCTenuringIf (2) (96.30%)
  1.23%   (  0.05%) addToFreeListbytes (1) (97.53%)
  1.23%   (  0.05%) mapInterpreterOops (1) (98.77%)
  1.23%   (  0.05%) ioUTCMicroseconds (1) (100.0%)

I am definitely not as proficient as in the generated vm code part. But as usual, when you want to know what these functions do, it is better to have a look directly in the VMMaker package!
Usually, at least a few of these functions will refer to garbage collection. Here, this is indeed the case for the first four functions. So according to these results, we can see that 90,12% of the time spent in the vanilla vm code was spent scavenging (see the 90,12% in the cumulative row). I copied below the comments of these functions.

  • scavengeReferentsOf: inspects all the pointers in referrer.  If any are new objects, it has them moved to FutureSurvivorSpace, and answers truth. If there are no new referents, it answers falsity
  • doScavenge: the inner shell for scavenge, abstracted out so globalGarbageCollect can use it.
  • copyAndForward: copies a survivor object either to futureSurvivorSpace or, if it is to be promoted, to oldSpace.
  • shouldRemapObj: Answer if the obj should be scavenged (or simply followed)

If you want to have more global informations about memory management,  you can directly have a look in the third part of the report, in which you will find the total amount of performed full GC's, scavenges, tenures...

3. Samples in the rest 

The last samples, that could not be linked to a known symbol, are gathered in this part. Usually, this represents a non-significant amount of samples.

30 samples in the rest    1.61% of total
% of rest (% of total) (samples) (cumulative)
100.0%    (  1.61%) Samples Not In Any Function (30) (100.0%)


The next article will be about the third part of the report, which gather data about memory usage, among other things. It will also provide some insights about how to tune the VM according to the results.
Thanks for reading!

Understanding the basics behind a profiling report - Part 1: general data

Hi !

So in my last post about the VM profiler, I mostly described how to install the profiler in Pharo and roughly how to read the textual report. 
This time, I will go through a report in detail and I will try to explain how does the profiling work and refer to relevant articles. The profiling report can be divided in 3 parts, and each part will be the subject of a blog post. I won't go into the details of the VMProfiler code ; if you're interested in that part or if you already are familiar with the VM profiler, I suggest you read this article from Eliot Miranda which gives you a more precise insight about how the profiler works.

So, let's begin :


Part 1 - General data about the profiling 


Let's profile the following block: 1 to: 100000000 do: [ :i | Object new ]. What does the first part of the report say ?

/Users/sophiekaleba/Documents/WIP-part1/Pharo/Pharo.app/Contents/MacOS/Pharo 2017-07-10 15:35:15
eden size: 3,801,936  stack pages: 50  code size: 1,048,576
1 to: 100000000 do: [ :i | Object new ]
gc prior.  clear prior.
1.409 seconds; sampling frequency 1326 hz
1838 samples in the VM (1868 samples in the entire program)  98.39% of total
1757 samples in generated vm code 95.59% of entire vm (94.06% of total)
81 samples in vanilla vm code   4.41% of entire vm (  4.34% of total)


/Users/sophiekaleba/Documents/WIP-part1/Pharo/Pharo.app/Contents/MacOS/Pharo 2017-07-10 15:35:15
That's the path of the VM used, and the date and time of the profiling. Fairly easy for now.

eden size: 3,801,936  stack pages: 50  code size: 1,048,576
Ok. A bit more complex for the uninitiated.
The eden size (in bytes) refers to the size of the "young space" in the virtual machine (Cog) memory. This is the space where newly allocated objects are stored. If you want to learn more about how the Cog memory is divided and why, I deeply recommend you read this article
This data is relevant in this report because, in some cases, the size of the eden can cause an increase of execution time : for instance, if the Eden is too small, and the running program is actually allocating a very large amount of objects, it can trigger a lot of scavenges (basically, moving the stored objects in a particular space in the Eden). The scavenges are performed once a specific threshold is reached in the eden.

Stack pages. The VM representation of the stack is a linked list of stack pages. By default, the number of stack pages is set to 50. Like the eden size, this piece of information is significant, as in some specific cases, 50 pages are not enough and could lead to an increase in execution time.

Code size. This one deals again with the Cog memory (you can refer to the same article). It refers to the size of the Cog method zone, i.e. the memory space containing trampolines/enilopmarts, Cog methods, CogFullBlocks, PICs, and a linked list of elements with young referents. 
This zone, when full, can be compacted which can have an impact on execution time, especially if compactions occur frequently. It is possible to modify the Code size to reduce the amount of compactions and speed up the execution time.

 1 to: 100000000 do: [ :i | Object new ] 
The block you profiled.

gc prior.  clear prior.
gc prior. a GC is performed just before the profiling starts, to make sure we start from a fresh environment (so we know that if a GC is performed during the profiling, it is actually caused by the execution of the block )
clear prior. Besides the GC, some other operations are performed before the profiling starts. It includes making sure that the buffer containing the samples is cleared, initializing the profiling time and the data structure (a Bag) that will later hold the samples.

1.409 seconds; sampling frequency 1326 hz
Total time of the profiling. And the sampling frequency, usually between 1,3 and 1,4 Khz. 

1838 samples in the VM (1868 samples in the entire program)  98.39% of total
Total number of samples gathered during the profiling. A sample is basically an instruction pointer, pointing at the currently executed function.  So we can simplistically say that a sample refers to a function.  I will be more specific about that part in another post (soon). 
A difference is made between the "samples in the VM" and the "samples in the entire program" : indeed, some samples are not referring to functions in the VM, or are simply not identified. These are the 30 samples difference you can notice above. 
All the samples are then split between 3 categories :
  • the samples in generated vm code
  • the samples in vanilla vm code
  • the remaining samples

1757 samples in generated vm code 95.59% of entire vm (94.06% of total)
This is the total amount of samples in generated vm code, that is to say all the samples referring to a function / trampolines / enilopmarts... in the Cog method zone.

81 samples in vanilla vm code   4.41% of entire vm (  4.34% of total)
This is the total amount of samples in vanilla vm code, that is to say all the samples referring to a function in the C code of the VM (typically, the interpreter and the garbage collector).

That's all for this part, if you spot any mistake or have any question, feel free to comment !

mercredi 31 mai 2017

VM profiling in Pharo

The Squeak VMProfiler is now available headless in Pharo.
It allows you to profile an expression and get detailed statistics about where the time was spent on the VM side (JIT, interpreter, GC).

It is still in development so feel free to use it and report any bugs. Suggestions and comments are welcome !

Loading the profiler


First, you need to load the project in a Pharo 6 image.
Make sure you are using one of the latest VM, as the profiler uses plugins that were not in the older ones. You can download the latest VM+image using this script :


wget -O- get.pharo.org/60+vmLatest | bash


Then, you can either load the VMProfiler from the calatog browser (World Menu>Tools>Catalog browser) and install the stable version :

Or use the script below :
Gofer new
url: 'http://www.smalltalkhub.com/mc/Pharo/MetaRepoForPharo60/main';
package: 'ConfigurationOfVMProfiler';
load.

((Smalltalk at: #ConfigurationOfVMProfiler) project version: #stable) load

Using the profiler


Profiling an expression


When the configuration is loaded, a workspace should open :

String streamContents: [ :s | VMProfiler headlessSpyOn: [ 1 to: 10000000 do: [ :i | Object new ] ] reportOn: s ]

It basically shows you how to use the profiler : 
  • you spy on the expression you want to profile (here, the expression is  [ 1 to: 10000000 do: [ :i | Object new ] ])
  • the profiling report is then stored as a string (here, s
  • You have to inspect the string to actually read the report
Try and inspect the example from the workspace : you will see the profiling report displayed on screen.

If you want to test it with your own expression, then just type :


 String streamContents: [ :s | VMProfiler headlessSpyOn: [ myExpression ] reportOn: s 

and inspect it. As the profiler is cadenced at 1.3 GHz, make sure your expression runs for quite a long time to actually get a large amount of samples, otherwise the results may not be significant.


Analyzing the report


The report is composed of different parts :  

1- General informations about the profiling
  • VM used and its path 
  • Memory-related data (size of the eden, number of stack pages, code size) 
  • Duration of the profiling, the sampling frequency 
  • Total number of samples (a sample = an instruction pointer) gathered during the profiling

2- The profiling report in itself

This part provides the computed profiling statistics, based on the number of samples and their locations.
It is subdivided in 2 parts :
  • the time spent in the methods generated by the JIT compiler.
For instance, the Behavior>>new method is the one in which we spent the most time. Indeed, over 195 samples, 61 refer to this method
  • the time spent in the c code of the VM (interpreter and GC)
For instance, the copyAndForward method is the one in which we spent the most time. Indeed, over 17 samples, 6 refer to this method.


3- Statistics about the memory


This part deals with the state of the memory after the profiling. It helps detecting potential memory leaks.


4- Statistics about the Garbage collection(s)

This part lists the operations performed by the garbage collector during the profiling. It can help identifying where to tune the GC, especially when a large (and potentially abnormal) amount of time is spent in one of these operations.
 
  • full : the number of full GC performed and their total execution time.
  • scavenges : the number of scavenges (eden to survivor area) performed and their cumulated execution time.
  • tenures : the number of tenures (promotion to old space) performed.
  • root table : the number of root table (old objects pointing at young objects) overflows

5- Statistics about Compiled Code Compactions

This part deals with the number of compiled code compactions and their cumulated execution time. This kind of compaction occurs when the Cog method zone is full. A high (and time consuming) number of compactions may show that you need to increase the Cog Method zone size.


6- Statistics about Events

Like the previous parts, this part identifies other events that can occur during the profiling and can help you identify if something is wrong.
For instance, the size or the number of stack pages can be modified if a high amount of stack overflows is observed.

 

Next steps


As mentioned in my GSoC proposal, there are still plenty of things to do !

First, in the next few weeks, the profiling should get more detailed as the in-development JIT and its speculative inlining will require the mapping of the profiling samples to a range of bytecode to actually get a significant profiling.

Then, I will add a UI to the Pharo VMProfiler.

dimanche 14 mai 2017

Accepted as a Google Summer of Code 2017 student !


It turns out I will be spending my summer coding for the Pharo Consortium this year, which is pretty good news !

I will work on improving the VMProfiler ; here is a small abstract of my proposal :

Improving the VM Profiler

The current VM profiler is a sampling profiler cadenced at 1.3GHz tracking down where the time is spent in the C code of the VM (for the interpreter and the GC) and in the machine code zone (for the code generated by the JIT). The VM profiler cannot track down precisely where the time is spent when executing the code generated by the JIT. It can track down in which methods the time is spent, but it cannot track down in which part of those methods the time is spent.

Goals

  • Port the VM Profiler to Pharo (it only works in Squeak for the moment)
By making sure the UI stays independent and by patching all the uses of the library OSProcess to make it work in both Squeak and Pharo
  • Get a more detailed profiling
By using the JIT API to map machine code instruction pointer to bytecode program counter, to be able in the end to show in which bytecode range of methods the time is spent instead of only in which methods the time is spent.

Expect to find here some detailed posts related to profiling and the things I will work on!