[Slony1-hackers] Brainstorming Results

Fri Dec 3 09:26:05 PST 2010

This collects ideas from meetings held in November/December 2010 for the
next major version of Slony.

It is quite likely that many of these features ''won't'' get
implemented, based on a "more work than they're worth" evaluation.

The point of the exercise has been to capture a goodly set of ideas to
make it possible to effectively evaluate what might be worth
implementing.

== slon extensions ==

=== Slon monitoring ===

* per Chris Browne

* slon records in a queryable form what it's working on

* Requires writing (+COMMIT) at the start of the event loop

* [As noted by Yehuda] This ought to also be useful with slonik, to
  allow indicating "whazzup"?

Debate took place surrounding various mechanisms.  The only one without
dramatically unacceptable properties was to add a table to the Slony-I
schema.

; SNMP
: Existing standard for network-based monitoring
: Downside - requires extra infrastructure including libraries and possibly additional tooling to use it
; [http://www.spread.org/ Spread]
: Fast
: Downside - requires extra infrastructure that's not particularly "standard"
; NOTIFY/LISTEN
: Built-in to Postgres - no new infrastructure needed
: In Postgres 9.0+ can carry payload
: Only becomes visible upon COMMIT
: Results are not stored; listeners must pre-declare their interest
; SQL table
: Built-in to Postgres - no new infrastructure needed
: Only becomes visible upon COMMIT
: Another table to manage and clean up

==== Monitoring Requirements ====

Crucial facts that we want to know about:
# Is replication behind?
# What are components (e.g. - slon, slonik) doing? Note that slon has numerous threads
# Recent abnormal states for events (e.g. - error messages)
# Are any non-SYNC events outstanding?
# Backlog volume?
# What is the cluster's configuration?

===== Replication Behind? =====

The existing view sl_status captures this from a what-is-confirmed
perspective.  That is not perfect, but it is not obvious that there is
high priority to enhancing this.

===== What are components doing? =====

Nothing relevant is captured in a usable fashion.

It is thought that what we may do is to add a table where each thread
would capture ''what am I doing?'' (which would replace whatever was
previously being done)

This table would contain a tuple for:
# Each remote worker thread
# Cleanup thread
# Each remote listener thread
# Local SYNC thread

It would track things such as:
# Time processing started
# What thread/process am I?
# What node am I for?
# What am I doing? Possibly in several pieces, to cover the following sorts of facts:
## Event ID
## Event type Though this could be pulled from sl_event, given node and event ID
## Additional event activity Again, could be pulled from sl_event, given node and event ID

Note that the contents of this table should be quite tiny; a tuple per slon thread on a node.

This also needs to be able to capture what '''slonik''' is doing; this seems more troublesome.
# It is possible to have multiple slonik instances acting concurrently - multiple concurrent events!
# There is no natural "event loop" such that slonik activities would be expected to clean themselves up over time

====== Suggested slon implementation ======

Two approaches emerged for establishing connections to capture this monitoring data
# Each thread opens its own DB connection <BR> Unacceptable: Leads to ''enormous'' increase in use of DB connections that are mostly basically idle
# Establish a "monitoring thread"
## A message queue allows other threads to stow entries (complete with timestamps) that the monitoring thread periodically flushes to the database
## It is plausible that this thread could be merged into the existing local SYNC thread which isn't terribly busy

===== Recent abnormal states for events  =====

This captures messages about the most recent problem that occurred, storing:
# Time of abnormality
# Event ID
# Node ID
# Description / Error Message

===== non-SYNC events outstanding? =====

This information is already captured, and may be revealed by running a query that asks, on the source node, for all events that are:
# Not SYNC events
# Have not been confirmed by the subscriber

===== Backlog volume =====

[http://www.slony.info/bugzilla/show_bug.cgi?id=166 Bug #166]

This seems troublesome; calculating the number of sl_log_* tuples
involved in a particular SYNC requires running the same complex query
that the remote_worker thread uses to determine which tuples are to be
applied.

This is a query that is complex to generate that is fairly expensive to
run.

Note that [http://www.slony.info/bugzilla/show_bug.cgi?id=167 Bug #167] is changing this query.

===== Cluster configuration =====

There is an existing tool that does some analysis of cluster
configuration; see
[http://git.postgresql.org/gitweb?p=slony1-engine.git;a=blob;f=tools/test_slony_state.pl;h=fdc9dcc060229f39a1e1ac8608e33d63054658bf;hb=refs/heads/master
test_slony_state.pl]

It is desirable to have something that generates diagrams of the relationships between nodes, capturing:
# Nodes
# Subscription Sets, and the paths they take
# Paths between nodes
# Listen paths

It would be nice for the Subscription Set diagram to include indication of replication state/lag for each node, indicating things like:
# Event Number
# Events Behind Parent
# Time Behind Parent
# Events Behind Origin
# Time Behind Origin

=== Faster Replication - COPY Protocol ===
* Use COPY + Triggers on sl_log_*
* per Jan Wieck
* New encoding of tuple information
* Triggers do direct heap updates
* Eliminates overhead of parsing each statement
* COPY implicitly introduces streaming
** Eliminates need for current memory management logic for processing large tuples
* Should reduce amount of work done by slons to parse sl_log_* cursors, generating I/U/D streams
** Requires separating out log shipping into a separate daemon, as there are users known to depend on being able to parse log shipping data as INSERT/UPDATE/DELETE statements

=== SYNC pipelining ===
* per Jan Wieck
* open 2 connections to source DB, start pulling new data while the previous request is pushing I/U/D requests to the subscriber
* Might be unnecessary if using COPY+triggers to stream data
* Maximum value comes if the time required to get to the point of '''%f seconds delay for first row''' is equal to the remainder of the time required to process the SYNC
** If query to start up the cursor on sl_log_* takes most of the time, then there's not much gained by starting the next one early
** If query to process sl_log_* data takes most of the time, again, there's not much gained by starting the next one early

=== Compress sequences of DELETE requests ===
* per Chris Browne
* Note that TRUNCATE is already replicated
* Two ways to interpret it...
# log trigger compresses them upon receipt, essentially merging subsequent requests into a single sl_log_* tuple This changes the '''INSERT-only''' understanding of sl_log_*
# slon remote worker thread compresses them when processing a SYNC

Overall, this seems a pretty dubious feature.

=== SNMP ===
* per Chris Browne
* Some minimal SNMP functionality was added in 2005
* Untouched since; requires --with-netsnmp option to ./Configure
* Should we improve it, or drop it?
* [http://lists.slony.info/pipermail/slony1-general/2010-November/011301.html list thread]

* In discussions of 2010-11-30, some debate over whether this is worth augmenting to support some of the monitoring requirements discussed elsewhere
** In favour, SNMP exists and is a standard protocol for network-based monitoring
** Contrary, several things were observed:
*** Unknown whether the present code works
*** For us to use SNMP within Slony-I for monitoring draws in a dependency on SNMP libraries and possibly other tooling. We sure ''wouldn't'' want to mandate installing and configuring something like Nagios as a component of Slony-I.
*** In contrast, consider that we already have '''libpq''' used everywhere; for us to capture some monitoring information in a table requires no additional components.

=== Deal better with sl_log backlog ===
* When replication is backlogged the sl_log tables can grow to be huge. This slows down replication.
* Have slony create new sl_log_x tables as needed.
* [http://lists.slony.info/pipermail/slony1-general/2010-November/011343.html See mailing list discussion]
* Note that Vivek Khera has noticed it happens less when he's using SSD versus spinning disks, so evidently disk performance has material effect on this.

=== Shunned Node ===
* Per [http://lists.slony.info/pipermail/slony1-general/2010-November/011353.html discussion on list]
* Allow configuring that some nodes should be ignored for the purpose of confirmations.
* This allows the cleanup thread to trim out sl_log_(1|2) data.
* FAILOVER may already support the notion that a "shunned" node might be lost.
* An interesting extension: SUBSCRIBE SET could automatically mark the "child" node as shunned until such time as the subscription has completed and caught up.
** This only works if only one set is involved; if there are multiple subscriptions, shunning only works out well for the first one.
** It will work just fine for multiple subscriptions. The point here is to free the master and other forwarders than the data provider for the node, that is busy subscribing, from keeping copies of the log.

=== Health Checks ===

* At time of node start up, check to see if the nodes providing my subscriptions believe my node exists
** If they do, all is well
** If they don't, then presumably I'm a failed node
** If a connection cannot be established, then warn of this (probably with a pretty quick timeout) but continue, for now...

=== Use application_name ===

* per Chris Browne
* If on a version that supports [http://www.postgresql.org/docs/9.0/static/runtime-config-logging.html#GUC-APPLICATION-NAME GUC application_name], connections should capture "this is slon"
* application_name is available on PG 9.0+
* The value we use should likely include various (possibly all) of the following:
** slon
** ID of the node being managed by slon
** ID of the secondary node being queried
** perhaps the cluster name?

== Bugs Known ==
===  #53 - high RAM usage with high table # ===
* [http://bugs.slony.info/bugzilla/show_bug.cgi?id=53 #53]
* Likely involves replacing array with hash table
* Possible BSD-licensed hash table implementations
** [http://www.cl.cam.ac.uk/~cwc22/hashtable/ C Hash Table] <BR> See also [https://github.com/ryantenney/chashtable ryantenney / chashtable @ GitHub] <BR> Note: [http://xen.1045712.n5.nabble.com/Legal-concerns-added-hashtable-implementation-td2485274.html Used in Xen]
** [http://uthash.sourceforge.net/ UTHash]
** [http://burtleburtle.net/bob/c/lookup3.c lookup3] - Public Domain

===  #80 - slon daemon restarts itself in a loop after failover ===
* [http://bugs.slony.info/bugzilla/show_bug.cgi?id=80 #80]
===  #81 - duplicate key sl_nodelock-pkey and duplicate slon(8) processes not detected ===
* [http://bugs.slony.info/bugzilla/show_bug.cgi?id=81 #81]
===  #111 - UNSUBSCRIBE SET cancels outstanding SUBSCRIBE SET ===
* [http://bugs.slony.info/bugzilla/show_bug.cgi?id=111 #111]
===  #126 - client side KEEPALIVE on connections ===
* [http://bugs.slony.info/bugzilla/show_bug.cgi?id=126 #126]
===  #137 - EXECUTE SCRIPT not applied in right order ===
* [http://bugs.slony.info/bugzilla/show_bug.cgi?id=137 #137]
* Shift DDL from sl_event to sl_log_(1|2)
* Allow DBA to specify which locks EXECUTE SCRIPT requires in the slonik script

===  #152 - DDL noisy - might be handled well by "Other Health Criteria" ideas ===
* [http://bugs.slony.info/bugzilla/show_bug.cgi?id=152 #152]
===  #163 - Change to use TIMESTAMPTZ in Slony-defined tables ===
* [http://bugs.slony.info/bugzilla/show_bug.cgi?id=163 #163]
=== #166 : Size of SYNC ===
* [http://bugs.slony.info/bugzilla/show_bug.cgi?id=166 Bug #166]
* Useful to ask how much work is involved in particular SYNCs
* Allows evaluating questions like:
** Is this SYNC enormous?
** Is slon busy processing a huge SYNC?

Alternatively, if SYNC is small, long running SYNC suggests something broken

If we add a stored function that evaluates this, it should allow a significant simplification of C logic in src/slon/remote_worker.c which presently does the following:
# Pull outstanding transaction lists for two events
# Generate a potentially enormous WHERE clause indicating the set of transactions that need to be excluded when pulling tuples from sl_log_(1|2)
# Compress that WHERE clause in order that the query not be so long that the Postgres parser falls over

Instead, we imagine that a function may do the following:
# Create a temp table to capture transaction ID values
# Truncate that table, if it already exists
# Pull the transaction list described above, and stow it in the temp table

The WHERE clause would be altered to draw from the temp table instead of writing values explicitly into the WHERE clause.  This would eliminate a pretty substantial body of code from src/slon/remote_worker.c, including:
# The function compress_actionseq (about 280 lines of code)
# Portions of sync_helper()
# Portions of sync_event()

Open questions:
* Is this a material savings?
* Is there a problem with repeatedly running TRUNCATE against the temp table, once for each set of SYNC events that is processed?
* Would we want an index on the temp table?

== slonik extensions ==

=== ABORT ===
* Initially [http://lists.slony.info/pipermail/slony1-general/2010-November/011295.html mentioned] by Vivek Khera
* Fail if replication is behind
** event count - "within 3 events is OK"
** time interval "behind by no more than 15 seconds is OK"
* Other Health Criteria
** per Chris Browne
** Is node there?
** Is replication set there?
** Is subscription active?
** Run SQL, fail if ERROR

=== Implicit WAIT FOR EVENT ===
* [http://lists.slony.info/pipermail/slony1-general/2010-November/011288.html proposal]
* Initially mentioned by Chris Browne
* Should track (per Yehuda) some status information so that if it's waiting for a sustained period of time, this isn't a mystery to the user or to people monitoring things.
* Slonik commands that require inter-node coordination should check to see if all relevant slon processes are running
** notable exceptions include STORE NODE, STORE PATH
** Slonik should warn or error out if the slon isn't running

==== Controlling Implicit WAIT FOR EVENT ====

To support both legacy slonik scripts and new ones, the following features are suggested:
* slonik should have a command line option that deactivates "auto-wait"
* It may be desirable for uses to control wait behavior inside scripts, hence we should add two slonik commands:
** activate auto wait
** deactivate auto wait

=== DATE ===
* [http://lists.slony.info/pipermail/slony1-general/2010-November/011328.html suggested] by Stuart Bishop
* Basically, notion of being able to get timestamps during a Slonik script

=== TRY BLOCK ===
* Chris, Stuart conversation
* Perhaps we should forbid running non-transactional commands within a TRY block
* Non-transactional commands
** WAIT FOR EVENT
** FAILOVER
** EXECUTE SCRIPT
** STORE NODE
* Perhaps TRY should be eliminated?

=== Specified Preamble ===
* per Steve Singer
* -p option draws in preamble automagically
* See [http://lists.slony.info/pipermail/slony1-general/2010-November/011325.html]
=== Bulk Adding Tables ===
* per Steve Singer
* set add table (... tables='public.*', ..);
* set add table (.. tables='billing.(!password)', ...);
* See [http://lists.slony.info/pipermail/slony1-general/2010-November/011325.html]

=== fewer defaults for SET ADD TABLE ===
* per Steve Singer
* Automatically determine an ID
* Automatically determine the origin   
* See [http://lists.slony.info/pipermail/slony1-general/2010-November/011325.html]
=== automate logging ===
* per Chris Browne
* -v option leads to all requests being logged
* might it be valuable to allow specifying syslog parameters?
* Verbose logging to syslog eliminates need for DATE command
* Existing ECHO command allows simulating this manually; DATE would complete that.

=== application_name ===
* per Chris Browne
* On PG 9.0+ (can determine based on whether GUC '''application_name''' exists or not), can capture application name
* For slonik, it is likely apropos to set GUC '''application_name''' to '''slonik'''

=== Advisory Locks ===
Provide a method so that applications can detect advisory locks.
Per Stuart Bishop [http://lists.slony.info/pipermail/slony1-general/2010-August/011076.html]

=== New FAILOVER ===

General proposal, that FAILOVER be a much more sophisticated command, allowing:
* Dropping nodes considered dead
* Doing several failovers of sets as one request

Thus, something like:
<pre>
  failover (dead nodes=(1,2,4),
            set id=1, backup node=3,
            set id=2, backup node=5,
            set id=3, backup node=3);
</pre>

* Failover should check various conditions and abort if any are the case
** There need to be paths to support communications to let the new masters catch up
** Slons need to be running for nodes that are needed to let masters catch up
** If a node hosts a subscription that cannot be kept

=== Interactive Mode ===
* line by line running custom command

== General Features ==
=== Commit timestamps ===
* per Jan Wieck
* Requires PostgreSQL extension
* Eliminates need for periodic generation of SYNC events
* Simplifies queries for searching for sl_log_* data
* Enables carryover of commit times to subscribers
=== DDL Triggers ===
* per Jan Wieck
* Requires PostgreSQL extension
* Enables automatic generation of DDL_SCRIPT events
* Discussed as... [[DDL_Triggers]]

=== pull lexxer from postgres ===
* From TODO, probably per Peter Eisentraut
* Note that this may be a fair bit more complex than the code in src/parsestatements

=== Report on the size of pending SYNC requests ===
* [http://bugs.slony.info/bugzilla/show_bug.cgi?id=166 Bug 166]
* Could be implemented as an extension to test_slony_state.pl or inside of slonik

=== Log Shipping ===
* Afilias doesn't use it; wouldn't mind diminishing the code base by its removal.
* There appear to be some users of it that would be injured by its removal, so that seems like a no-go.
* An attractive idea: Split slon into two forms:
# The "usual" one which manages nodes that are databases amenable to forwarding, failover, and such
# Log-shipping-only nodes
* Note that there is a lot of log-shipping functionality strewn throughout remote_worker.c.  If split out, this might make the log shipping node code simple, and '''regular slon''' simpler, too.

=== Common Admin Conninfo ===

There are several tools known that need to know administrative conninfo connection information:
# slon processes likely use this connection to access the node that the slon manages
# slonik requires connections potentially to all nodes, and presently uses '''ADMIN CONNINFO''' preamble material to configure this.
# Administrative tools such as the configuration analysis tool, '''test_slony_state.pl'''

It is suggested that we create a common "properties file" to capture this in a form that might be reused by various common tools.
-- 
let name="cbbrowne" and tld="afilias.info" in String.concat "@" [name;tld];;
Christopher Browne
"Bother,"  said Pooh,  "Eeyore, ready  two photon  torpedoes  and lock
phasers on the Heffalump, Piglet, meet me in transporter room three"