Tue Jan 11 20:11:21 PST 2005
- Previous message: [Slony1-general] Slony performance
- Next message: [Slony1-general] Slony performance
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
James Black wrote:
> Hello, all,
>
> We're having a very similar problem; our replication is happening in a
> timely fashion, but the load on the machine providing the set is
> uncomfortably high, given the load. A 'top -c' reveals the slony
> process, more or less pegging one of the CPUs, running the "FETCH 100
> ..." query.
>
> We have a lot of rows in sl_log_1, on the order of 400,000. We don't
> have anything sticking around in sl_confirm; or extra nodes that
> aren't being serviced; or anything else particularly suspicious.
I'm still suspicious of sl_confirm; try the query below; I'll bet it'll
show something being behind.
> The cleanupThread vacuums vary in length from .3 to 1.2 seconds;
> however, our "delay for first rows" times have grown from 1.7 seconds
> at daemon startup, to around 6.1 seconds. The slon daemons have been
> running for about 20 hours.
>
> I wonder if we're just running into a limitation of the system; that
> we're seeing traffic high enough that the standard 10 minute cleanup
> latency is too long. Is this a modifiable value? Is that even a
> fruitful avenue for investigation? Our slony daemons are set with
> -g24 and -s1000; are those reasonable values? Is there something I'm
> missing?
The values you're describing are all relatively reasonable, and I'd
think it likely that they are "red herrings."
Let me be more specific about the sl_confirm thing.
Try the following query:
slondb=# select con_origin, con_received, max(con_seqno),
max(con_timestamp), now() - max(con_timestamp) as age from sl_confirm
group by con_origin, con_received order by age;
con_origin | con_received | max | max
| age
------------+--------------+---------+----------------------------+------------------
1 | 78 | 4497949 | 2005-01-11 19:56:10.254644 |
-00:00:54.476242
501 | 78 | 25767 | 2005-01-11 19:55:51.760295 |
-00:00:35.981893
3 | 78 | 21059 | 2005-01-11 19:55:39.394631 |
-00:00:23.616229
4 | 78 | 19826 | 2005-01-11 19:55:31.051068 |
-00:00:15.272666
1 | 4 | 4497949 | 2005-01-11 19:55:15.373074 |
00:00:00.405328
1 | 3 | 4497949 | 2005-01-11 19:55:15.367026 |
00:00:00.411376
1 | 501 | 4497948 | 2005-01-11 19:55:14.397912 |
00:00:01.38049
501 | 3 | 25767 | 2005-01-11 19:54:56.871924 |
00:00:18.906478
501 | 4 | 25767 | 2005-01-11 19:54:56.865216 |
00:00:18.913186
501 | 1 | 25767 | 2005-01-11 19:54:56.81173 |
00:00:18.966672
3 | 501 | 21059 | 2005-01-11 19:54:44.558 |
00:00:31.220402
3 | 4 | 21059 | 2005-01-11 19:54:44.503998 |
00:00:31.274404
3 | 1 | 21059 | 2005-01-11 19:54:44.450835 |
00:00:31.327567
4 | 501 | 19826 | 2005-01-11 19:54:36.458941 |
00:00:39.319461
4 | 3 | 19826 | 2005-01-11 19:54:36.439382 |
00:00:39.33902
4 | 1 | 19826 | 2005-01-11 19:54:36.262963 |
00:00:39.515439
78 | 501 | 7229 | 2005-01-11 19:54:22.726457 |
00:00:53.051945
78 | 4 | 7229 | 2005-01-11 19:54:22.663651 |
00:00:53.114751
78 | 3 | 7229 | 2005-01-11 19:54:22.65302 |
00:00:53.125382
78 | 1 | 7229 | 2005-01-11 19:54:22.645781 |
00:00:53.132621
(20 rows)
On that system, the node that is "least recently confirmed" is node 78,
where some of the confirmations were as much as a minute old.
The system is essentially free to delete all sl_log_1 entries that are
more than 10:00 + 00:53 or about 11 minutes old.
And analyzing sl_log_1 shows that it's being cleaned out in a pretty
timely fashion.
INFO: analyzing "_slondb.sl_log_1"
INFO: "sl_log_1": 243 pages, 3000 rows sampled, 8748 estimated total rows
ANALYZE
All of the things you describe about your system are symptoms of
"retaining sl_log_1 entries." If I shut off node 78 for a few hours,
I'd see exactly what you describe.
If there's no node shut off, I'd almost be ready to bet lunch that the
problem is that you don't have a sufficiently full sl_listen "network."
If sl_log_1 isn't getting cleaned out, on the master node, then that
means that one or more node isn't successfully passing back confirmations.
Quoting from: http://linuxdatabases.info/info/listenpaths.html
Three queries to test that out at least in part:
You may find nodes that haven't got enough listen paths via the query:
select li_origin, count(*) from sl_listen group by li_origin having
count(*) < (select count(*) - 1 from sl_node );
Search for specific missing paths
select * from (select n1.no_id as origin, n2.no_id as receiver from
sl_node n1, sl_node n2 where n1.no_id != n2.no_id) as foo where not
exists (select 1 from sl_listen where li_origin = origin and
li_receiver = receiver);
Each subscriber node must have a direct listen path
Find missing ones as follows: select * from sl_subscribe where not
exists (select 1 from sl_listen where li_origin = sub_provider and
li_receiver = sub_receiver and li_provider = sub_provider);
If any of these queries return any result on any node, then you've got
some sort of problem with sl_listen that needs to be be fixed.
Supposing this turns out to be the case, and you add a STORE LISTEN or
two, you can expect to immediately see a whole _horde_ of entries go
into sl_confirm. It may take a couple of cleanup thread cycles for
everything to settle down.
- Previous message: [Slony1-general] Slony performance
- Next message: [Slony1-general] Slony performance
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list