Glyn Astill glynastill at yahoo.co.uk
Fri Oct 3 05:35:49 PDT 2014
Hi All,

I'm looking at a slony setup using 2.1.4, with 4 nodes in the following configuration:

    Node 1 --> Node 2
    Node 1 --> Node 3 --> Node 4

Node
 1 is the origin of all sets, and node 3 is a provider of all to node 
4.  What I'm looking to do is fail over to node 2 when both nodes 1 and 3
 have gone down.

Is this possible? 

In both a live environment that I've not had chance
 to move to 2.2 and my test environment I'm seeing the same issues, for my test environment the slonik script is:

    CLUSTER NAME = test_replication;

    NODE 1 ADMIN CONNINFO = 'dbname=TEST host=localhost port=5432 user=slony';
    NODE 2 ADMIN CONNINFO = 'dbname=TEST host=localhost port=5433 user=slony';
    NODE 3 ADMIN CONNINFO = 'dbname=TEST host=localhost port=5434 user=slony';
    NODE 4 ADMIN CONNINFO = 'dbname=TEST host=localhost port=5435 user=slony';

    SUBSCRIBE SET (ID = 1, PROVIDER = 2, RECEIVER = 4, FORWARD = YES);
    WAIT FOR EVENT (ORIGIN = 2, CONFIRMED = 4, WAIT ON = 2);
    SUBSCRIBE
 SET (ID = 2, PROVIDER = 2, RECEIVER = 4, FORWARD = YES);
    WAIT FOR EVENT (ORIGIN = 2, CONFIRMED = 4, WAIT ON = 2);
    SUBSCRIBE SET (ID = 3, PROVIDER = 2, RECEIVER = 4, FORWARD = YES);
    WAIT FOR EVENT (ORIGIN = 2, CONFIRMED = 4, WAIT ON = 2);

    DROP NODE (ID = 3, EVENT NODE = 2);

    FAILOVER (
        ID = 1, BACKUP NODE = 2
    );

    DROP NODE (ID = 1, EVENT NODE = 2);

slonik is failing at the first subscribe set line as follows:

    $ slonik test.scr
    test.scr:8: could not connect to server: Connection refused
        Is the server running on host "localhost" (127.0.0.1) and accepting
        TCP/IP connections on port 5432?
    test.scr:8: could not connect to server: Connection refused
        Is the server running on host "localhost" (127.0.0.1) and accepting
        TCP/IP connections on port 5434?
    test.scr:8: could not connect to server: Connection refused
        Is the server running on host "localhost" (127.0.0.1) and accepting
        TCP/IP connections on port 5432?
    Segmentation fault

I get the same behaviour until I bring node 1 back up, then the script almost succeeds, but for an error
stating that a record in sl_event already exists:

    $ slonik ~/test.scr
    ~/test.scr:8: could not connect to server: Connection refused
        Is the server running on host "localhost" (127.0.0.1) and accepting
        TCP/IP connections on port 5434?
    waiting for events  (1,5000000172) only at (1,5000000162) to be confirmed on node 4
    executing failedNode() on 2
    ~/test.scr:17: NOTICE:  failedNode: set 1 has no other
 direct receivers - move now
    ~/test.scr:17: NOTICE:  failedNode: set 2 has no other direct receivers - move now
    ~/test.scr:17: NOTICE:  failedNode: set 3 has no other direct receivers - move now
    ~/test.scr:17: NOTICE:  failedNode: set 1 has other direct receivers - change providers only
    ~/test.scr:17: NOTICE:  failedNode: set 2 has other direct receivers - change providers only
    ~/test.scr:17: NOTICE:  failedNode: set 3 has other direct receivers - change providers only
    NOTICE: executing "_test_replication".failedNode2 on node 2
    ~/test.scr:17: waiting for event (1,5000000175).  node 4 only on event 5000000162
   
 NOTICE: executing "_test_replication".failedNode2 on node 2
   
 ~/test.scr:17: PGRES_FATAL_ERROR lock table 
"_test_replication".sl_event_lock, 
"_test_replication".sl_config_lock;select 
"_test_replication".failedNode2(1,2,2,'5000000174','5000000176');  - 
ERROR:  duplicate key value violates unique constraint "sl_event-pkey"
    DETAIL:  Key (ev_origin, ev_seqno)=(1, 5000000176) already exists.
    CONTEXT:  SQL statement "insert into "_test_replication".sl_event
                (ev_origin, ev_seqno, ev_timestamp,
                ev_snapshot,
                ev_type, ev_data1, ev_data2,
 ev_data3)
                values
                (p_failed_node, p_ev_seqfake, CURRENT_TIMESTAMP,
                v_row.ev_snapshot,
                'FAILOVER_SET', p_failed_node::text, p_backup_node::text,
                p_set_id::text)"
    PL/pgSQL function _test_replication.failednode2(integer,integer,integer,bigint,bigint) line 14 at SQL statement
    NOTICE: executing "_test_replication".failedNode2 on node 2
    ~/test.scr:17: waiting for
 event (1,5000000177).  node 4 only on event 5000000175
    ~/test.scr:21: begin transaction; - 

 After this sl_set on node 4 still has node 1 as the origin for one of the sets
 (Is this possibly becasuse I'm not waiting properly or waiting on the wrong node?):

    TEST=# table _test_replication.sl_set;
     set_id | set_origin | set_locked |    set_comment
    --------+------------+------------+-------------------
          2 |          1 |            | Replication set 2
   
       1 |          2 |            | Replication set 1
          3 |          2 |            | Replication set 3
    (3 rows)


I had attached the slon logs, but my mail to the list bounced, if that would provide any better insight I can provide them.


Any help would be greatly appreciated.

ThanksGlyn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20141003/21c6a823/attachment.htm 


More information about the Slony1-general mailing list