[Slony1-general] slony backup during pg_dump

Thu Jun 13 09:48:40 PDT 2019

On Thu, Jun 13, 2019 at 7:59 AM Christopher Browne <cbbrowne at afilias.info>
wrote:

> On Wed, Jun 12, 2019 at 11:58 PM Tory M Blue <tmblue at gmail.com> wrote:
>
>> First postgres 9.x
>> CentOS 7.2
>> Replication flows:
>> primary ->secondary-> 3 query slaves
>>
>> Okay here it is,
>>
>> My dump started:
>>
>> Start export of schema clsdb for user postgres at Wed Jun 12 19:00:48 PDT
>> 2019
>>
>> So if I'm reading this correct it's 20:45 but something is showing that
>> the last _received_event_ts was at 19:38? over an hour ago?
>> sl_lag_time 1 hour?
>>
>> Primary:
>>
>> select * from _cls.sl_status limit 3;
>>
>>  st_origin | st_received | st_last_event |       st_last_event_ts        |
>> st_last_receive
>>
>> d |      st_last_received_ts      |   st_last_received_event_ts   |
>> st_lag_num_events |
>>
>> st_lag_time
>>
>>
>> -----------+-------------+---------------+-------------------------------+----------------
>>
>>
>> --+-------------------------------+-------------------------------+-------------------+---
>>
>> --------------
>>
>>         11 |          14 |    5036030946 | 2019-06-12 20:45:29.465785-07
>> |       503603013
>>
>> 8 | 2019-06-12 19:38:11.065976-07 | 2019-06-12 19:38:03.311267-07 |
>>           808 | 01
>>
>> :07:33.039694
>>
>>         11 |          15 |    5036030946 | 2019-06-12 20:45:29.465785-07
>> |       503603013
>>
>> 8 | 2019-06-12 19:38:12.202539-07 | 2019-06-12 19:38:03.311267-07 |
>>           808 | 01
>>
>> :07:33.039694
>>
>>         11 |          12 |    5036030946 | 2019-06-12 20:45:29.465785-07
>> |       503603013
>>
>> 8 | 2019-06-12 19:38:08.342637-07 | 2019-06-12 19:38:03.311267-07 |
>>           808 | 01
>>
>> :07:33.039694
>>
>> (3 rows)
>>
>>
>> I have 2.4 million rows in the sl_log_1 table, which looks like it
>> started to hockey stick right around 19:30.
>>
>>
>>
> You're reading it pretty well right.
>
> It is mighty curious that it waited 37-ish minutes for replication to fall
> off.
>
> It would be interesting (and it's too late for today) to capture the lag
> between those times.
>
> Thus, something like...
>
> #!/bin/sh
> PGDATABASE=postgresql://cbbrowne@localhost:7096/shared-sample
> psql -d $PGDATABASE -c "create table if not exists lag_by_time (as_at
> timestamp default now(), replication_lag interval);"
> for i in `seq 180`; do
>    psql -d $PGDATABASE -c "insert into lag_by_time (replication_lag)
> select min(st_lag_time) from _cls.sl_status;"
>    sleep 60
> done
>
> and launch that at about 19:00.  At whatever point replication_lag heads
> towards the Stanley Cup, that's the time at which Something Happened.
>
> It's probably interesting to look at pg_stat_activity to see what
> long-running queries there might be.
>
> I'm somewhat suspicious that, despite what you were intending, perhaps the
> pg_dump is nevertheless capturing the schema _cls; it would be "sorta
> normal" if, at the point at which pg_dump starts dumping _cls material,
> replication starts lagging.
>
> In particular, anything in pg_stat_activity that started at 19:38 ought to
> be looked at with particular suspicion as a candidate for being the cause.
>
>
>>
>> On the secondary (which replicates to my query db's (where the dump is
>> run from (not run from the primary), it includes these lines
>>
>>
>> Secondary:
>>
>>
>>  select * from _cls.sl_status limit 3;
>>
>>  st_origin | st_received | st_last_event |       st_last_event_ts        |
>> st_la
>>
>> st_received |      st_last_received_ts      |   st_last_received_event_ts
>>   | st
>>
>> _lag_num_events |   st_lag_time
>>
>>
>> -----------+-------------+---------------+-------------------------------+------
>>
>>
>> ------------+-------------------------------+-------------------------------+---
>>
>> ----------------+-----------------
>>
>>         12 |          11 |    5008650405 | 2019-06-12 20:45:35.365757-07
>> |
>>
>>  5008650000 | 2019-06-12 19:38:03.815508-07 | 2019-06-12
>> 19:38:03.338486-07 |
>>
>>             405 | 01:07:35.106539
>>
>>         12 |          13 |    5008650405 | 2019-06-12 20:45:35.365757-07
>> |
>>
>>  5008650403 | 2019-06-12 20:45:17.701455-07 | 2019-06-12
>> 20:45:15.355233-07 |
>>
>>               2 | 00:00:23.089792
>>
>>         12 |          14 |    5008650405 | 2019-06-12 20:45:35.365757-07
>> |
>>
>>  5008650404 | 2019-06-12 20:45:26.642094-07 | 2019-06-12
>> 20:45:25.360167-07 |
>>
>>               1 | 00:00:13.084858
>>
>> (3 rows)
>>
>
> May I presume that the time you ran this query was around 20:45 or 20:46?
>   That looks unsurprising/consistent with the first query; events going
> through node 11, which appears to be the master, are held back to 19:38.
> And the few events that go through coming from other nodes are lightly
> laggy, consistent with those nodes not generating SYNC events very often.
>

You assume correctly. i ran the queries back to back, and yes node 11 is
the primary/master and node 12 is the standby, with nodes 13-15 just query
nodes and only a subset of the data, they are not available for
failover/switchover, only nodes 11-12 are.

I saw nothing in the locks when I looked. We did notice that in our second
dump, while we are using a -n table, we are not including a -N
_slonschema,. Now that should work!! but we are going to add the -N
_slonschema as well as the -n table, just to see. Will poke it some more
this eve!

Thanks again
Tory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20190613/791660d3/attachment-0001.htm