mame/docs/source/techspecs/poly_manager.rst

Software 3D Rendering in MAME
=============================

.. contents:: :local:


Background
----------

Beginning in the late 1980s, many arcade games began incorporating hardware-rendered
3D graphics into their video. These 3D graphics are typically rendered from low-level
primitives into a frame buffer (usually double- or triple-buffered), then perhaps
combined with traditional tilemaps or sprites, before being presented to the player.

When it comes to emulating 3D games, there are two general approaches. The first
approach is to leverage modern 3D hardware by mapping the low-level primitives onto
modern equivalents. For a cross-platform emulator like MAME, this requires having an
API that is flexible enough to describe the primitives and all their associated
behaviors with high accuracy. It also requires the emulator to be able to read back
from the rendered frame buffer (since many games do this) and combine it with other
elements, in a way that is properly synchronized with background rendering.

The alternative approach is to render the low-level primitives directly in software.
This has the advantage of being able to achieve pretty much any behavior exhibited by
the original hardware, but at the cost of speed. In MAME, since all emulation happens
on one thread, this is particularly painful. However, just as with the 3D hardware
approach, in theory a software-based approach could be spun off to other threads to
handle the work, as long as mechanisms were present to synchronize when necessary,
for example, when reading/writing directly to/from the frame buffer.

For the time being, MAME has opted for the second approach, leveraging a templated
helper class called **poly_manager** to handle common situations.


Concepts
--------

At its core, **poly_manager** is a mechanism to support multi-threaded rendering of
low-level 3D primitives. Callers provide **poly_manager** with a set of *vertices* for a
primitive plus a *render callback*. **poly_manager** breaks the primitive into
clipped scanline *extents* and distributes the work among a pool of *worker
threads*. The render callback is then called on the worker thread for each extent,
where game-specific logic can do whatever needs to happen to render the data.

One key responsibility that **poly_manager** takes care of is ensuring order. Given a
pool of threads and a number of work items to complete, it is important that—at least
within a given scanline—all work is performed serially in order. The basic approach is
to assign each extent to a *bucket* based on the Y coordinate. **poly_manager** then ensures
that only one worker thread at a time is responsible for processing work in a given bucket.

Vertices in **poly_manager** consist of simple 2D X and Y *coordinates*, plus zero or
more additional *iterated parameters*. These iterated parameters can be anything: intensity
values for lighting; RGB(A) colors for Gouraud shading; normalized U, V coordinates for
texture mapping; 1/Z values for Z buffering; etc. Iterated parameters, regardless of what
they represent, are interpolated linearly across the primitive in screen space and provided
as part of the extent to the render callback.


ObjectType
~~~~~~~~~~

When creating a **poly_manager** class, you must provide it a special type that you define,
known as **ObjectType**.

Because rendering happens asynchronously on worker threads, the idea is that the
**ObjectType** class will hold a snapshot of all the relevant data needed for rendering.
This allows the main thread to proceed—potentially modifying some of the relevant state—while
rendering happens elsewhere.

In theory, we could allocate a new **ObjectType** class for each primitive rendered;
however, that would be rather inefficient. It is quite common to set up the rendering
state and then render several primitives using the same state.

For this reason, **poly_manager** maintains an internal array of **ObjectType** objects and
keeps a copy of the last **ObjectType** used. Before submitting a new primitive, callers
can see if the rendering state has changed. If it has, it can ask **poly_manager** to allocate
a new **ObjectType** class and fill it in. When the primitive is submitted for rendering, the
most recently allocated **ObjectType** instance is implicitly captured and provided to the
render callbacks.

For more complex scenarios, where data might change even more infrequently, there is a
**poly_array** template, which can be used to manage data in a similar way. In fact,
internally **poly_manager** uses the **poly_array** class to manage its **ObjectType**
allocations. More information on the **poly_array** class is provided later.


Primitives
~~~~~~~~~~

**poly_manager** supports several different types of primitives:

* The most commonly-used primitive in **poly_manager** is the *triangle*, which has the
  nice property that iterated parameters have constant deltas across the full surface.
  Arbitrary-length *triangle fans* and *triangle strips* are also supported.

* In addition to triangles, **poly_manager** also supports *polygons* with an arbitrary
  number of vertices. The list of vertices is expected to be in either clockwise or
  anticlockwise order. **poly_manager** will walk the edges to compute deltas across
  each extent.

* As a special case, **poly_manager** supports a *tile* primitive, which is a simple quad
  defined by two vertices, a top-left vertex and a bottom-right vertex. Like triangles,
  tiles have constant iterated parameter deltas across their surface.

* Finally, **poly_manager** supports a fully custom mechanism where the caller provides
  a list of extents that are more or less fed directly to the worker threads.
  This is useful if emulating a system that has unusual primitives or requires highly
  specific behaviors for its edges.


Synchronization
~~~~~~~~~~~~~~~

One of the key requirements of providing an asynchronous rendering mechanism is
synchronization. Synchronization in **poly_manager** is super simple: just
call the ``wait()`` function.

There are several common reasons for issuing a wait:

* At display time, the pixel data must be copied to the screen. If any primitives were
  queued which touch the portion of the display that is going to be shown, you need to
  wait for rendering to be complete before copying. Note that this wait may not be
  strictly necessary in some situations (for example, a triple-buffered system).

* If the emulated system has a mechanism to read back from the framebuffer after
  rendering, then a wait must be issued prior to the read in order to ensure that
  asynchronous rendering is complete.

* If the emulated system modifies any state that is not cached in the **ObjectType**
  or elsewhere (for example, texture memory), then a wait must be issued to ensure
  that pending primitives which might consume that state have finished their work.

* If the emulated system can use a previous render target as, say, the texture source
  for a new primitive, then submitting the second primitive must wait until the first
  completes. **poly_manager** provides no internal mechanism to help detect this, so it
  is on the caller to determine when or if this is necessary.

Because the wait operation knows after it is done that all rendering is complete,
**poly_manager** also takes this opportunity to reclaim all memory allocated for its
internal structures, as well as memory allocated for **ObjectType** structures. Thus it is
important that you don’t hang onto any **ObjectType** pointers after a wait is called.


The poly_manager class
----------------------

In most applications, **poly_manager** is not used directly, but rather serves as
the base class for a more complete rendering class. The **poly_manager** class
itself is a template::

    template<typename BaseType, class ObjectType, int MaxParams, u8 Flags = 0>
    class poly_manager;

and the template parameters are:

* **BaseType** is the type used internally for coordinates and iterated parameters, and
  should generally be either ``float`` or ``double``. In theory, a fixed-point integral
  type could also be used, though the math logic has not been designed for that, so you
  may encounter problems.

* **ObjectType** is the user-defined per-object data structure described above.
  Internally, **poly_manager** will manage a **poly_array** of these, and a pointer to
  the most-recently allocated one at the time a primitive is submitted will be implicitly
  passed to the render callback for each corresponding extent.

* **MaxParams** is the maximum number of iterated parameters that may be specified in a
  vertex. Iterated parameters are generic and treated identically, so the mapping of
  parameter indices is completely up to the contract between the caller and the render
  callback. It is permitted for **MaxParams** to be 0.

* **Flags** is zero or more of the following flags:

  - POLY_FLAG_NO_WORK_QUEUE — specify this flag to disable asynchronous rendering; this
    can be useful for debugging. When this option is enabled, all primitives are queued
    and then processed in order on the calling thread when ``wait()`` is called on the
    **poly_manager** class.

  - POLY_FLAG_NO_CLIPPING — specify this if you want **poly_manager** to skip its
    internal clipping. Use this if your render callbacks do their own clipping, or if
    the caller always handles clipping prior to submitting primitives.


Types & Constants
~~~~~~~~~~~~~~~~~

vertex_t
++++++++

Within the **poly_manager** class, you’ll find a **vertex_t** type that describes a
single vertex. All primitive drawing methods accept 2 or more of these **vertex_t**
objects. The **vertex_t** includes the X and Y coordinates along with an array of
iterated parameter values at that vertex::

    struct vertex_t
    {
        vertex_t() { }
        vertex_t(BaseType _x, BaseType _y) { x = _x; y = _y; }

        BaseType x, y;                          // X, Y coordinates
        std::array<BaseType, MaxParams> p;      // iterated parameters
    };

Note that **vertex_t** itself is defined in terms of the **BaseType** and **MaxParams**
template values of the owning **poly_manager** class.

All of **poly_manager**’s primitives operate in screen space, where (0,0) represents the
top-left corner of the top-left pixel, and (0.5,0.5) represents the center of that pixel.
Left and top pixel values are inclusive, while right and bottom pixel values are exclusive.

Thus, a *tile* rendered from (2,2)-(4,3) will completely cover 2 pixels: (2,2) and (3,2).

When calling a primitive drawing method, the iterated parameter array **p** need not be
completely filled out. The number of valid iterated parameter values is specified as a
template parameter to the primitive drawing methods, so only that many parameters need to
actually be populated in the **vertex_t** structures that are passed in.


extent_t
++++++++

**poly_manager** breaks primitives into extents, which are contiguous horizontal spans
contained within a single scanline. These extents are then distributed to worker threads,
who will call the render callback with information on how to render each extent. The
**extent_t** type describes one such extent, providing the bounding X coordinates along with
an array of iterated parameter start values and deltas across the span::

    struct extent_t
    {
        struct param_t
        {
            BaseType start;                     // parameter value at start
            BaseType dpdx;                      // dp/dx relative to start
        };
        int16_t startx, stopx;                  // starting (inclusive)/ending (exclusive) endpoints
        std::array<param_t, MaxParams> param;   // array of parameter start/deltas
        void *userdata;                         // custom per-span data
    };

For each iterated parameter, the **start** value contains the value at the left side of
the span. The **dpdx** value contains the change of the parameter’s value per X coordinate.

There is also a **userdata** field in the **extent_t** structure, which is not normally used,
except when performing custom rendering.


render_delegate
+++++++++++++++

When rendering a primitive, in addition to the vertices, you must also provide a
**render_delegate** callback of the form::

  void render(int32_t y, extent_t const &extent, ObjectType const &object, int threadid)

This callback is responsible for the actual rendering. It will be called at a later time,
likely on a different thread, for each extent. The parameters passed are:

* **y** is the Y coordinate (scanline) of the current extent.

* **extent** is a reference to a **extent_t** structure, described above, which specifies for
  this extent the start/stop X values along with the start/delta values for each iterated
  parameter.

* **object** is a reference to the most recently allocated **ObjectType** at the time the
  primitive was submitted for rendering; in theory it should contain most of not all of the
  necessary data to perform rendering.

* **threadid** is a unique ID indicating the index of the thread you’re running on; this value
  is useful if you are keeping any kind of statistics and don’t want to add contention over
  shared values. In this situation, you can allocate **WORK_MAX_THREADS** instances of your
  data and update the instance for the **threadid** you are passed. When you want to display
  the statistics, the main thread can accumulate and reset the data from all threads when it’s
  safe to do so (e.g., after a wait).


Methods
~~~~~~~

poly_manager
++++++++++++
::

    poly_manager(running_machine &machine);

The **poly_manager** constructor takes just one parameter, a reference to the
**running_machine**. This grants **poly_manager** access to the work queues needed for
multithreaded running.

wait
++++
::

    void wait(char const *debug_reason = "general");

Calling ``wait()`` stalls the calling thread until all outstanding rendering is complete:

* **debug_reason** is an optional parameter specifying the reason for the wait. It is
  useful if the compile-time constant **TRACK_POLY_WAITS** is enabled, as it will print a
  summary of wait times and reasons at the end of execution.

**Return value:** none.

object_data
+++++++++++
::

    objectdata_array &object_data();

This method just returns a reference to the internally-maintained **poly_array** of the
**ObjectType** you specified when creating **poly_manager**. For most applications, the
only interesting thing to do with this object is call the ``next()`` method to allocate
a new object to fill out.

**Return value:** reference to a **poly_array** of **ObjectType**.

register_poly_array
+++++++++++++++++++
::

    void register_poly_array(poly_array_base &array);

For advanced applications, you may choose to create your own **poly_array** objects to
manage large chunks of infrequently-changed data, such a palettes. After each ``wait()``,
**poly_manager** resets all the **poly_array** objects it knows about in order to reclaim all
outstanding allocated memory. By registering your **poly_array** objects here, you can ensure
that your arrays will also be reset after an ``wait()`` call.

**Return value:** none.

render_tile
+++++++++++
::

    template<int ParamCount>
    uint32_t render_tile(rectangle const &cliprect, render_delegate callback,
                         vertex_t const &v1, vertex_t const &v2);

This method enqueues a single *tile* primitive for rendering:

* **ParamCount** is the number of live values in the iterated parameter array within each
  **vertex_t** provided; it must be no greater than the **MaxParams** value specified in the
  **poly_manager** template instantiation.

* **cliprect** is a reference to a clipping rectangle. All pixels and parameter values are
  clipped to stay within these bounds before being added to the work queues for rendering,
  unless **POLY_FLAG_NO_CLIPPING** was specified as a flag parameter to **poly_manager**.

* **callback** is the render callback delegate that will be called to render each extent.

* **v1** contains the coordinates and iterated parameters for the top-left corner of the tile.

* **v2** contains the coordinates and iterated parameters for the bottom-right corner of the tile.

**Return value:** the total number of clipped pixels represented by the enqueued extents.

render_triangle
+++++++++++++++
::

    template<int ParamCount>
    uint32_t render_triangle(rectangle const &cliprect, render_delegate callback,
                             vertex_t const &v1, vertex_t const &v2, vertex_t const &v3);

This method enqueues a single *triangle* primitive for rendering:

* **ParamCount** is the number of live values in the iterated parameter array within each
  **vertex_t** provided; it must be no greater than the **MaxParams** value specified in the
  **poly_manager** template instantiation.

* **cliprect** is a reference to a clipping rectangle. All pixels and parameter values are
  clipped to stay within these bounds before being added to the work queues for rendering,
  unless **POLY_FLAG_NO_CLIPPING** was specified as a flag parameter to **poly_manager**.

* **callback** is the render callback delegate that will be called to render each extent.

* **v1**, **v2**, **v3** contain the coordinates and iterated parameters for each vertex
  of the triangle.

**Return value:** the total number of clipped pixels represented by the enqueued extents.

render_triangle_fan
+++++++++++++++++++
::

    template<int ParamCount>
    uint32_t render_triangle_fan(rectangle const &cliprect, render_delegate callback,
                                 int numverts, vertex_t const *v);

This method enqueues one or more *triangle* primitives for rendering, specified in fan order:

* **ParamCount** is the number of live values in the iterated parameter array within each
  **vertex_t** provided; it must be no greater than the **MaxParams** value specified in the
  **poly_manager** template instantiation.

* **cliprect** is a reference to a clipping rectangle. All pixels and parameter values are
  clipped to stay within these bounds before being added to the work queues for rendering,
  unless **POLY_FLAG_NO_CLIPPING** was specified as a flag parameter to **poly_manager**.

* **callback** is the render callback delegate that will be called to render each extent.

* **numverts** is the total number of vertices provided; it must be at least 3.

* **v** is a pointer to an array of **vertex_t** objects containing the coordinates and iterated
  parameters for all the triangles, in fan order. This means that the first vertex is fixed.
  So if 5 vertices are provided, indicating 3 triangles, the vertices used will be:
  (0,1,2) (0,2,3) (0,3,4)

**Return value:** the total number of clipped pixels represented by the enqueued extents.

render_triangle_strip
+++++++++++++++++++++
::

    template<int ParamCount>
    uint32_t render_triangle_strip(rectangle const &cliprect, render_delegate callback,
                                   int numverts, vertex_t const *v);

This method enqueues one or more *triangle* primitives for rendering, specified in strip order:

* **ParamCount** is the number of live values in the iterated parameter array within each
  **vertex_t** provided; it must be no greater than the **MaxParams** value specified in the
  **poly_manager** template instantiation.

* **cliprect** is a reference to a clipping rectangle. All pixels and parameter values are
  clipped to stay within these bounds before being added to the work queues for rendering,
  unless **POLY_FLAG_NO_CLIPPING** was specified as a flag parameter to **poly_manager**.

* **callback** is the render callback delegate that will be called to render each extent.

* **numverts** is the total number of vertices provided; it must be at least 3.

* **v** is a pointer to an array of **vertex_t** objects containing the coordinates and iterated
  parameters for all the triangles, in strip order.
  So if 5 vertices are provided, indicating 3 triangles, the vertices used will be:
  (0,1,2) (1,2,3) (2,3,4)

**Return value:** the total number of clipped pixels represented by the enqueued extents.

render_polygon
++++++++++++++
::

    template<int NumVerts, int ParamCount>
    uint32_t render_polygon(rectangle const &cliprect, render_delegate callback, vertex_t const *v);

This method enqueues a single *polygon* primitive for rendering:

* **NumVerts** is the number of vertices in the polygon.

* **ParamCount** is the number of live values in the iterated parameter array within each
  **vertex_t** provided; it must be no greater than the **MaxParams** value specified in the
  **poly_manager** template instantiation.

* **cliprect** is a reference to a clipping rectangle. All pixels and parameter values are
  clipped to stay within these bounds before being added to the work queues for rendering,
  unless **POLY_FLAG_NO_CLIPPING** was specified as a flag parameter to **poly_manager**.

* **callback** is the render callback delegate that will be called to render each extent.

* **v** is a pointer to an array of **vertex_t** objects containing the coordinates and iterated
  parameters for the polygon. Vertices are assumed to be in either clockwise or anticlockwise
  order.

**Return value:** the total number of clipped pixels represented by the enqueued extents.

render_extents
++++++++++++++
::

    template<int ParamCount>
    uint32_t render_extents(rectangle const &cliprect, render_delegate callback,
                            int startscanline, int numscanlines, extent_t const *extents);

This method enqueues custom extents directly:

* **ParamCount** is the number of live values in the iterated parameter array within each
  **vertex_t** provided; it must be no greater than the **MaxParams** value specified in the
  **poly_manager** template instantiation.

* **cliprect** is a reference to a clipping rectangle. All pixels and parameter values are
  clipped to stay within these bounds before being added to the work queues for rendering,
  unless **POLY_FLAG_NO_CLIPPING** was specified as a flag parameter to **poly_manager**.

* **callback** is the render callback delegate that will be called to render each extent.

* **startscanline** is the Y coordinate of the first extent provided.

* **numscanlines** is the number of extents provided.

* **extents** is a pointer to an array of **extent_t** objects containing the start/stop
  X coordinates and iterated parameters. The **userdata** field of the source extents is
  copied to the target as well (this field is otherwise unused for all other types of
  rendering).

**Return value:** the total number of clipped pixels represented by the enqueued extents.

zclip_if_less
+++++++++++++
::

    template<int ParamCount>
    int zclip_if_less(int numverts, vertex_t const *v, vertex_t *outv, BaseType clipval);

This method is a helper method to clip a polygon against a provided Z value. It assumes
that the first iterated parameter in **vertex_t** represents the Z coordinate. If any edge
crosses the Z plane represented by **clipval** that edge is clipped.

* **ParamCount** is the number of live values in the iterated parameter array within each
  **vertex_t** provided; it must be no greater than the **MaxParams** value specified in the
  **poly_manager** template instantiation.

* **numverts** is the number of vertices in the input array.

* **v** is a pointer to the input array of **vertex_t** objects.

* **outv** is a pointer to the output array of **vertex_t** objects. **v** and **outv**
  cannot overlap or point to the same memory.

* **clipval** is the value to compare parameter 0 against for clipping.

**Return value:** the number of output vertices written to **outv**.
Note that by design it is possible for this method to produce more vertices than the
input array, so callers should ensure there is enough room in the output buffer to
accommodate this.


Example Renderer
----------------

Here is a complete example of how to create a software 3D renderer using **poly_manager**.
Our example renderer will only handle flat and Gouraud-shaded triangles with depth (Z)
buffering.


Types
~~~~~

The first thing we need to define is our *externally-visible* vertex format, which is distinct
from the internal **vertex_t** that **poly_manager** will define. In theory you could
use **vertex_t** directly, but the generic nature of **poly_manager**’s iterated parameters
make it awkward::

    struct example_vertex
    {
        float x, y, z;      // X,Y,Z coordinates
        rgb_t color;        // color at this vertex
    };

Next we define the **ObjectType** needed by **poly_manager**. For our simple case, we
define an **example_object_data** struct that consists of pointers to our rendering buffers,
plus a couple of fixed values that are consumed in some cases. More complex renderers would
typically have many more object-wide parameters defined here::

    struct example_object_data
    {
        bitmap_rgb32 *dest;    // pointer to the rendering bitmap
        bitmap_ind16 *depth;   // pointer to the depth bitmap
        rgb_t color;           // overall color (for clearing and flat shaded case)
        uint16_t depthval;     // fixed depth v alue (for clearing)
    };

Now it’s time to define our renderer class, which we derive from **poly_manager**. As
template parameters we specify ``float`` as the base type for our data, since that will
be enough accuracy for this example, and we also provide our **example_object_data** as
the **ObjectType** class, plus the maximum number of iterated parameters our renderer
will ever need (4 in this case)::

    class example_renderer : public poly_manager<float, example_object_data, 4>
    {
    public:
        example_renderer(running_machine &machine, uint32_t width, uint32_t height);

        bitmap_rgb32 *swap_buffers();

        void clear_buffers(rgb_t color, uint16_t depthval);
        void draw_triangle(example_vertex const *verts);

    private:
        static uint16_t ooz_to_depthval(float ooz);

        void draw_triangle_flat(example_vertex const *verts);
        void draw_triangle_gouraud(example_vertex const *verts);

        void render_clear(int32_t y, extent_t const &extent, example_object_data const &object, int threadid);
        void render_flat(int32_t y, extent_t const &extent, example_object_data const &object, int threadid);
        void render_gouraud(int32_t y, extent_t const &extent, example_object_data const &object, int threadid);

        int m_draw_buffer;
        bitmap_rgb32 m_display[2];
        bitmap_ind16 m_depth;
    };


Constructor
~~~~~~~~~~~

The constructor for our example renderer just initializes **poly_manager** and allocates
the rendering and depth buffers::

    example_renderer::example_renderer(running_machine &machine, uint32_t width, uint32_t height) :
        poly_manager(machine),
        m_draw_buffer(0)
    {
        // allocate two display buffers and a depth buffer
        m_display[0].allocate(width, height);
        m_display[1].allocate(width, height);
        m_depth.allocate(width, height);
    }


swap_buffers
~~~~~~~~~~~~

The first interesting method in our renderer is ``swap_buffers()``, which returns a pointer to
the buffer we’ve been drawing to, and sets up the other buffer as the new drawing target. The
idea is that the display update handler will call this method to get ahold of the bitmap to
display to the user::

    bitmap_rgb32 *example_renderer::swap_buffers()
    {
        // wait for any rendering to complete before returning the buffer
        wait("swap_buffers");

        // return the current draw buffer and then switch to the other
        // for future drawing
        bitmap_rgb32 *result = &m_display[m_draw_buffer];
        m_draw_buffer ^= 1;
        return result;
    }

The most important thing here to note here is the call to **poly_manager**’s ``wait()``, which
will block the current thread until all rendering is complete. This is important because
otherwise the caller may receive a bitmap that is still being drawn to, leading to torn
or corrupt visuals.


clear_buffers
~~~~~~~~~~~~~

One of the most common operations to perform when doing 3D rendering is to initialize or
clear the display and depth buffers to a known value. This method below leverages
the *tile* primitive to render a rectangle over the screen by passing in (0,0) and (width,height)
for the two vertices.

Because the color and depth values to clear the buffer to are constant, they are stored in
a freshly-allocated **example_object_data** object, along with a pointer to the buffers in
question. The ``render_tile()`` call is made with a ``<0>`` suffix indicating that there are
no iterated parameters to worry about::

    void example_renderer::clear_buffers(rgb_t color, uint16_t depthval)
    {
        // allocate object data and populate it with information needed
        example_object_data &object = object_data().next();
        object.dest = &m_display[m_draw_buffer];
        object.depth = &m_depth;
        object.color = color;
        object.depthval = depthval;

        // top,left coordinate is always (0,0)
        vertex_t topleft;
        topleft.x = 0;
        topleft.y = 0;

        // bottom,right coordinate is (width,height)
        vertex_t botright;
        botright.x = m_display[0].width();
        botright.y = m_display[0].height();

        // render as a tile with 0 iterated parameters
        render_tile<0>(m_display[0].cliprect(),
                       render_delegate(&example_renderer::render_clear, this),
                       topleft, botright);
    }

The render callback provided to ``render_tile()`` is also defined (privately) in our class,
and handles a single span. Note how the rendering parameters are extracted from the
**example_object_data** struct provided::

    void example_renderer::render_clear(int32_t y, extent_t const &extent, example_object_data const &object, int threadid)
    {
        // get pointers to the start of the depth buffer and destination scanlines
        uint16_t *depth = &object.depth->pix(y);
        uint32_t *dest = &object.dest->pix(y);

        // loop over the full extent and just store the constant values from the object
        for (int x = extent.startx; x < extent.stopx; x++)
        {
            dest[x] = object.color;
            depth[x] = object.depthval;
        }
    }

Another important point to make is that the X coordinates provided by extent struct are
inclusive of startx but exclusive of stopx. Clipping is performed ahead of time so that
the render callback can focus on laying down pixels as quickly as possible with minimal
overhead.


draw_triangle
~~~~~~~~~~~~~

Next up, we have our actual triangle rendering function, which will draw a single triangle
given an array of three vertices provided in the external **example_vertex** format::

    void example_renderer::draw_triangle(example_vertex const *verts)
    {
        // flat shaded case
        if (verts[0].color == verts[1].color && verts[0].color == verts[2].color)
            draw_triangle_flat(verts);
        else
            draw_triangle_gouraud(verts);
    }

Because it is simpler and faster to render a flat shaded triangle, the code checks to see
if the colors are the same on all three vertices. If they are, we call through to a special
flat-shaded case, otherwise we process it as a full Gouraud-shaded triangle.

This is a common technique to optimize rendering performance: identify special cases that
reduce the per-pixel work, and route them to separate render callbacks that are optimized
for that special case.


draw_triangle_flat
~~~~~~~~~~~~~~~~~~

Here’s the setup code for rendering a flat-shaded triangle::

    void example_renderer::draw_triangle_flat(example_vertex const *verts)
    {
        // allocate object data and populate it with information needed
        example_object_data &object = object_data().next();
        object.dest = &m_display[m_draw_buffer];
        object.depth = &m_depth;

        // in this case the color is constant and specified in the object data
        object.color = verts[0].color;

        // copy X, Y, and 1/Z into poly_manager vertices
        vertex_t v[3];
        for (int vertnum = 0; vertnum < 3; vertnum++)
        {
            v[vertnum].x = verts[vertnum].x;
            v[vertnum].y = verts[vertnum].y;
            v[vertnum].p[0] = 1.0f / verts[vertnum].z;
        }

        // render the triangle with 1 iterated parameter (1/Z)
        render_triangle<1>(m_display[0].cliprect(),
                            render_delegate(&example_renderer::render_flat, this),
                            v[0], v[1], v[2]);
    }

First, we put the fixed color into the **example_object_data** directly, and then fill
out three **vertex_t** objects with the X and Y coordinates in the usual spot, and 1/Z
as our one and only iterated parameter. (We use 1/Z here because iterated parameters are
interpolated linearly in screen space. Z is not linear in screen space, but 1/Z is due to
perspective correction.)

Our flat-shaded case then calls ``render_trangle`` specifying ``<1>`` iterated parameter to
interpolate, and pointing to a special-case flat render callback::

    void example_renderer::render_flat(int32_t y, extent_t const &extent, example_object_data const &object, int threadid)
    {
        // get pointers to the start of the depth buffer and destination scanlines
        uint16_t *depth = &object.depth->pix(y);
        uint32_t *dest = &object.dest->pix(y);

        // get the starting 1/Z value and the delta per X
        float ooz = extent.param[0].start;
        float doozdx = extent.param[0].dpdx;

        // iterate over the extent
        for (int x = extent.startx; x < extent.stopx; x++)
        {
            // convert the 1/Z value into an integral depth value
            uint16_t depthval = ooz_to_depthval(ooz);

            // if closer than the current pixel, copy the color and depth value
            if (depthval < depth[x])
            {
                dest[x] = object.color;
                depth[x] = depthval;
            }

            // regardless, update the 1/Z value for the next pixel
            ooz += doozdx;
        }
    }

This render callback is a bit more involved than the clearing case.

First, we have an iterated parameter (1/Z) to deal with, whose starting and X-delta
values we extract from the extent before the start of the inner loop.

Second, we perform depth buffer testing, using ``ooz_to_depthval()`` as a helper
to transform the floating-point 1/Z value into a 16-bit integer. We compare this value against
the current depth buffer value, and only store the pixel/depth value if it’s less.

At the end of each iteration, we advance the 1/Z value by the X-delta in preparation for the
next pixel.


draw_triangle_gouraud
~~~~~~~~~~~~~~~~~~~~~

Finally we get to the code for the full-on Gouraud-shaded case::

    void example_renderer::draw_triangle_gouraud(example_vertex const *verts)
    {
        // allocate object data and populate it with information needed
        example_object_data &object = object_data().next();
        object.dest = &m_display[m_draw_buffer];
        object.depth = &m_depth;

        // copy X, Y, 1/Z, and R,G,B into poly_manager vertices
        vertex_t v[3];
        for (int vertnum = 0; vertnum < 3; vertnum++)
        {
            v[vertnum].x = verts[vertnum].x;
            v[vertnum].y = verts[vertnum].y;
            v[vertnum].p[0] = 1.0f / verts[vertnum].z;
            v[vertnum].p[1] = verts[vertnum].color.r();
            v[vertnum].p[2] = verts[vertnum].color.g();
            v[vertnum].p[3] = verts[vertnum].color.b();
        }

        // render the triangle with 4 iterated parameters (1/Z, R, G, B)
        render_triangle<4>(m_display[0].cliprect(),
                            render_delegate(&example_renderer::render_gouraud, this),
                            v[0], v[1], v[2]);
    }

Here we have 4 iterated parameters: the 1/Z depth value, plus red, green, and blue,
stored as floating point values. We call ``render_triangle()`` with ``<4>`` as the
number of iterated parameters to process, and point to the full Gouraud render callback::

    void example_renderer::render_gouraud(int32_t y, extent_t const &extent, example_object_data const &object, int threadid)
    {
        // get pointers to the start of the depth buffer and destination scanlines
        uint16_t *depth = &object.depth->pix(y);
        uint32_t *dest = &object.dest->pix(y);

        // get the starting 1/Z value and the delta per X
        float ooz = extent.param[0].start;
        float doozdx = extent.param[0].dpdx;

        // get the starting R,G,B values and the delta per X as 8.24 fixed-point values
        uint32_t r = uint32_t(extent.param[1].start * float(1 << 24));
        uint32_t drdx = uint32_t(extent.param[1].dpdx * float(1 << 24));
        uint32_t g = uint32_t(extent.param[2].start * float(1 << 24));
        uint32_t dgdx = uint32_t(extent.param[2].dpdx * float(1 << 24));
        uint32_t b = uint32_t(extent.param[3].start * float(1 << 24));
        uint32_t dbdx = uint32_t(extent.param[3].dpdx * float(1 << 24));

        // iterate over the extent
        for (int x = extent.startx; x < extent.stopx; x++)
        {
            // convert the 1/Z value into an integral depth value
            uint16_t depthval = ooz_to_depthval(ooz);

            // if closer than the current pixel, assemble the color
            if (depthval < depth[x])
            {
                dest[x] = rgb_t(r >> 24, g >> 24, b >> 24);
                depth[x] = depthval;
            }

            // regardless, update the 1/Z and R,G,B values for the next pixel
            ooz += doozdx;
            r += drdx;
            g += dgdx;
            b += dbdx;
        }
    }

This follows the same pattern as the flat-shaded callback, except we have 4 iterated parameters
to step through.

Note that even though the iterated parameters are of ``float`` type, we convert the
color values to fixed-point integers when iterating over them. This saves us doing 3
float-to-int conversions each pixel. The original RGB values were 0-255, so interpolation
can only produce values in the 0-255 range. Thus we can use 24 bits of a 32-bit integer as
the fraction, which is plenty accurate for this case.


Advanced Topic: the poly_array class
------------------------------------

**poly_array** is a template class that is used to manage a dynamically-sized vector of
objects whose lifetime starts at allocation and ends when ``reset()`` is called. The
**poly_manager** class uses several **poly_array** objects internally, including one for
allocated **ObjectType** data, one for each primitive rendered, and one for holding all
allocated extents.

**poly_array** has an additional property where after a reset it retains a copy of the most
recently allocated object. This ensures that callers can always call ``last()`` and get
a valid object, even immediately after a reset.

The **poly_array** class requires two template parameters::

    template<class ArrayType, int TrackingCount>
    class poly_array;

These parameters are:

* **ArrayType** is the type of object you wish to allocate and manage.

* **TrackingCount** is the number of objects you wish to preserve after a reset. Typically
  this value is either 0 (you don’t care to track any objects) or 1 (you only need one
  object); however, if you are using **poly_array** to manage a shared collection of
  objects across several independent consumers, it can be higher. See below for an example
  where this might be handy.

Note that objects allocated by **poly_array** are owned by **poly_array** and will be
automatically freed upon exit.

**poly_array** is optimized for use in high frequency multi-threaded systems. Therefore,
one added feature of the class is that it rounds the allocation size of **ArrayType** to
the nearest cache line boundary, on the assumption that neighboring entries could be
accessed by different cores simultaneously. Keeping each **ArrayType** object in its
own cache line ensures no false sharing performance impacts.

Currently, **poly_array** has no mechanism to determine cache line size at runtime, so
it presumes that 64 bytes is a typical cache line size, which is true for most x64 and ARM
chips as of 2021. This value can be altered by changing the **CACHE_LINE_SHIFT** constant
defined at the top of the class.

Objects allocated by **poly_array** are created in 64k chunks. At construction time, one
chunk’s worth of objects is allocated up front. The chunk size is controlled by the
**CHUNK_GRANULARITY** constant defined at the top of the class.

As more objects are allocated, if **poly_array** runs out of space, it will dynamically
allocate more. This will produce discontiguous chunks of objects until the next ``reset()``
call, at which point **poly_array** will reallocate all the objects into a contiguous
vector once again.

For the case where **poly_array** is used to manage a shared pool of objects, it can be
configured to retain multiple most recently allocated items by using a **TrackingCount**
greater than 1. For example, if **poly_array** is managing objects for two texture units,
then it can set **TrackingCount** equal to 2, and pass the index of the texture unit in
calls to ``next()`` and ``last()``. After a reset, **poly_array** will remember the most
recently allocated object for each of the units independently.


Methods
~~~~~~~

poly_array
++++++++++
::

    poly_array();

The **poly_array** constructor requires no parameters and simply pre-allocates one
chunk of objects in preparation for future allocations.

count
+++++
::

	u32 count() const;

**Return value:** the number of objects currently allocated.

max
+++
::

	u32 max() const;

**Return value:** the maximum number of objects ever allocated at one time.

itemsize
++++++++
::

	size_t itemsize() const;

**Return value:** the size of an object, rounded up to the nearest cache line boundary.

allocated
+++++++++
::

	u32 allocated() const;

**Return value:** the number of objects that fit within what’s currently been allocated.

byindex
+++++++
::

	ArrayType &byindex(u32 index);

Returns a reference to an object in the array by index. Equivalent to [**index**] on a
normal array:

* **index** is the index of the item you wish to reference.

**Return value:** a reference to the object in question. Since a reference is returned,
it is your responsibility to ensure that **index** is less than ``count()`` as there
is no mechanism to return an invalid result.

contiguous
++++++++++
::

	ArrayType *contiguous(u32 index, u32 count, u32 &chunk);

Returns a pointer to the base of a contiguous section of **count** items starting at
**index**. Because **poly_array** dynamically resizes, it may not be possible to access
all **count** objects contiguously, so the number of actually contiguous items is
returned in **chunk**:

* **index** is the index of the first item you wish to access contiguously.

* **count** is the number of items you wish to access contiguously.

* **chunk** is a reference to a variable that will be set to the actual number of
  contiguous items available starting at **index**. If **chunk** is less than **count**,
  then the caller should process the **chunk** items returned, then call ``countiguous()``
  again at (**index** + **chunk**) to access the rest.

**Return value:** a pointer to the first item in the contiguous chunk. No range checking
is performed, so it is your responsibility to ensure that **index** + **count** is less
than or equal to ``count()``.

indexof
+++++++
::

	int indexof(ArrayType &item) const;

Returns the index within the array of the given item:

* **item** is a reference to an item in the array.

**Return value:** the index of the item. It should always be the case that::

    array.indexof(array.byindex(index)) == index

reset
+++++
::

	void reset();

Resets the **poly_array** by semantically deallocating all objects. If previous allocations
created a discontiguous array, a fresh vector is allocated at this time so that future
allocations up to the same level will remain contiguous.

Note that the **ArrayType** destructor is *not* called on objects as they are deallocated.

**Return value:** none.

next
++++
::

	ArrayType &next(int tracking_index = 0);

Allocates a new object and returns a reference to it. If there is not enough space for
a new object in the current array, a new discontiguous array is created to hold it:

* **tracking_index** is the tracking index you wish to assign the new item to. In the
  common case this is 0, but could be non-zero if using a **TrackingCount** greater than 1.

**Return value:** a reference to the object. Note that the placement new operator is
called on this object, so the default **ArrayType** constructor will be invoked here.

last
++++
::

	ArrayType &last(int tracking_index = 0) const;

Returns a reference to the last object allocated:

* **tracking_index** is the tracking index whose object you want. In the
  common case this is 0, but could be non-zero if using a **TrackingCount** greater than 1.
  **poly_array** remembers the most recently allocated object independently for each
  **tracking_index**.

**Return value:** a reference to the last allocated object.