[Slony1-general] Logging Bandwidth Usage

Fri Oct 15 23:30:31 PDT 2004

An issue we recently ran into was that our network guys would like to
get a _clear_ picture as to how much bandwidth Slony-I is using.  It
has potential impact on costs, as we have some bandwidth providers
that charge extra for using more bandwidth than planned for.

I'm sure many would be interested in getting a clearer picture as to
how much of their bandwidth between destinations is getting "eaten up"
by replication.

Here are the beginnings of "a thought" on how we might get some
statistics (at this point incomplete, but hopefully still useful, at
least provisionally) out of the slon processes.

A Method To Estimate Bandwidth Usage

 A "first order" approximation of how much bandwidth is used might be
 gotten by recording the sizes of all the queries that get submitted.

 This would likely only reflect about 1/2 of the bandwidth usage, in
 view of the fact that for every instance of "insert into table x
 values (a, b, c)" that goes to a subscribing node, the contents of
 that query were retrieved from the provider node.

 That being said, knowing that "reality" is on the order of 2x the
 size of the queries submitted is still a useful thing to know.

A Mechanism To Determine Query Sizes

 The cases where slon submits queries, it virtually always submits
 them via the following idiom:

  res = PQexec(dbconn, dstring_data(&query1));

 dstring_data(x) is actually a macro that references a field in the
 query structure.

 One might capture the size of the query by modifying dstring_data(x)
 to surreptitiously use strlen() to measure the size of the query, and
 then add that value to a variable.

 Alternatively, we might create a wrapper function and replace usage
 of PQexec() with that function.  That is probably less scary :-).

A Mechanism To Report Usage

 There then needs to be a way to report the amount of query data that
 has passed through dstring_data(x)'s "mouth."

 I would suggest that reporting it in the slon logs each time the
 cleanup cycle runs might be a reasonably satisfactory interval.

A Method That Is Incomplete

 The above methodology is conspicuously incomplete in three ways that
 I can see:

  - It does not make any attempt to measure the amount of bandwidth
    consumed by results returned by queries.

    The size of the result set would be pretty messy to rummage
    through as a binary data structure.

    That being said, the results that one would expect to be of
    material size are those coming from queries of sl_log tables on
    the provider node that then lead to queries being submitted to the
    subscriber node.  And we would expect those to be of roughly equal
    size, implying that it might be right to multiply the sizes of
    queries by 2 for heavily updated databases.

  - It does not make any attempt to separate statistics for "requests"
    going to the provider from those for updates going to the
    subscriber.

  - It uses "size of queries" as a metric for "bandwidth used."

    That is necessarily only an approximation.

 Nonetheless, if we can get the "low hanging fruit" of being able to
 easily provide _some_ numbers, that may suffice to allow the creation
 of estimates that are, if not exact, at least based on a rational
 process.  It is better to have some numbers than to have no
 numbers...
-- 
"cbbrowne","@","ca.afilias.info"
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 673-4124 (land)