Discussion:
OpenSM (again)
Roland Fehrenbacher
2005-04-11 16:28:20 UTC
Permalink
Hi,

I got gen2 opensm running fine now (there was a problem with a wrong
include file), and managed to get IP running on a network of
currently 40 machines (final size will be 144). Performance is pretty
impressive (initial tests with a simple netpipe): I got a latency of
18microsec, and a maximum throughput of approx. 400MB/sec at packet
size approx. 1MB which then levels of at about 340MB/s for larger
packets.

One problem and two questions:

Problem: When I reboot all the 40 nodes (apart from the one the opensm
is running), the network is non-functional (no pings go through, even
though ports show status "Active") for quite a while (more than 10
minutes) after all the nodes have come up. It then recovers without
intervention. Is this normal? Single node reboots don't affect the
network operation. osm Log file is appended.

Question 1: Can I run opensm in a master slave configuration? I noticed
that there is a priority commandline option, but am not sure how to
apply this.

Question 2: I plan to run the gen1/Mellanox IBGD drivers on the
compute nodes (need fast MPI), and gen2 on the control/storage nodes
(need only IP) with gen2 opensm running on the control nodes. Is there
any reason why this should not work reliably?

Roland
Hal Rosenstock
2005-04-11 20:23:17 UTC
Permalink
Post by Roland Fehrenbacher
Hi,
I got gen2 opensm running fine now (there was a problem with a wrong
include file), and managed to get IP running on a network of
currently 40 machines (final size will be 144). Performance is pretty
impressive (initial tests with a simple netpipe): I got a latency of
18microsec, and a maximum throughput of approx. 400MB/sec at packet
size approx. 1MB which then levels of at about 340MB/s for larger
packets.
That's all good to hear :-)
Post by Roland Fehrenbacher
Problem: When I reboot all the 40 nodes (apart from the one the opensm
is running), the network is non-functional (no pings go through, even
though ports show status "Active") for quite a while (more than 10
minutes) after all the nodes have come up. It then recovers without
intervention. Is this normal? Single node reboots don't affect the
network operation. osm Log file is appended.
Can you describe your topology ? Is it the following: the SM is
connected to a switch/or switches with the 40 nodes connected off these
switches ?

I'll respond to the log (and these questions) in a separate email
response.
Post by Roland Fehrenbacher
Question 1: Can I run opensm in a master slave configuration?
Yes. Others are doing this.
Post by Roland Fehrenbacher
I noticed
that there is a priority commandline option, but am not sure how to
apply this.
SM election occurs per high priority low GUID. So if you don't care
which SM is the master than you don't need to do anything. If you want a
specific order (and it is not in GUID order) then you need to specify
priority.
Post by Roland Fehrenbacher
Question 2: I plan to run the gen1/Mellanox IBGD drivers on the
compute nodes (need fast MPI), and gen2 on the control/storage nodes
(need only IP) with gen2 opensm running on the control nodes. Is there
any reason why this should not work reliably?
So basically this appears to be an interop question:
1. Will gen2 OpenSM support IBGD nodes ?
2. Will gen2 IPoIB interoperate with IBGD IPoIB ?
I haven't done this but know of no reasons this should not work. Perhaps
others can add to this.

-- Hal
________________________________________________________________________
Hal Rosenstock
2005-04-11 20:52:26 UTC
Permalink
Post by Hal Rosenstock
Post by Roland Fehrenbacher
Problem: When I reboot all the 40 nodes (apart from the one the opensm
is running), the network is non-functional (no pings go through, even
though ports show status "Active") for quite a while (more than 10
minutes) after all the nodes have come up. It then recovers without
intervention. Is this normal? Single node reboots don't affect the
network operation. osm Log file is appended.
Can you describe your topology ? Is it the following: the SM is
connected to a switch/or switches with the 40 nodes connected off these
switches ?
What is the mix of those 40 nodes in terms of OpenIB (gen2) and gen1 ?
Is there no difference in the behavior of gen2 and gen1 in terms of the
above symptoms ?

-- Hal
Roland Fehrenbacher
2005-04-12 16:47:51 UTC
Permalink
Post by Roland Fehrenbacher
Problem: When I reboot all the 40 nodes (apart from the one
the opensm > is running), the network is non-functional (no
pings go through, even > though ports show status "Active") for
quite a while (more than 10 > minutes) after all the nodes have
come up. It then recovers without > intervention. Is this
normal? Single node reboots don't affect the > network
operation. osm Log file is appended.
Can you describe your topology ? Is it the following: the SM is
connected to a switch/or switches with the 40 nodes connected
off these switches ?
Hal> What is the mix of those 40 nodes in terms of OpenIB (gen2)
Hal> and gen1 ? Is there no difference in the behavior of gen2
Hal> and gen1 in terms of the above symptoms ?

So far all nodes are gen2.

Roland
Roland Fehrenbacher
2005-04-12 16:46:59 UTC
Permalink
Post by Roland Fehrenbacher
Problem: When I reboot all the 40 nodes (apart from the one the
opensm is running), the network is non-functional (no pings go
through, even though ports show status "Active") for quite a
while (more than 10 minutes) after all the nodes have come
up. It then recovers without intervention. Is this normal?
Single node reboots don't affect the network operation. osm Log
file is appended.
Hal> Can you describe your topology ? Is it the following: the SM
Hal> is connected to a switch/or switches with the 40 nodes
Hal> connected off these switches ?

Yes, the 40 nodes are connected to a single 144 port switch.

Hal> I'll respond to the log (and these questions) in a separate
Hal> email response.
Post by Roland Fehrenbacher
Question 1: Can I run opensm in a master slave configuration?
Hal> Yes. Others are doing this.
Post by Roland Fehrenbacher
I noticed that there is a priority commandline option, but am
not sure how to apply this.
Hal> SM election occurs per high priority low GUID. So if you
Hal> don't care which SM is the master than you don't need to do
Hal> anything. If you want a specific order (and it is not in GUID
Hal> order) then you need to specify priority.

Ok. I tried this, specifying priority 0 on one server, and priority 15
on another one. I assume priority 15, will be the master.
If I first start the priority 0 opensm, and then the priority 15 one,
things look normal: Log excerpts

priority 0 server

Apr 12 18:41:06 [4000] -> OpenSM Rev:openib-1.0.0
Apr 12 18:41:06 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c2.
Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c2.
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0001 TID:0x0000000000000011
Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0001 GID:0xfe80000000000000,0x0002c902004013c2
Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000d
Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a
Apr 12 18:42:25 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000e
Apr 12 18:42:25 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a

priority 15 server

Apr 12 18:42:25 [4000] -> OpenSM Rev:openib-1.0.0
Apr 12 18:42:25 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port 0x2c9020040133a.
Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port 0x2c9020040133a.
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.

When I kill the priority 15 server however, the priority 0 server runs
amok with continous log messages like:

Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with error(method=1 attr=20) -- dropping.
Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with error(method=1 attr=20) -- dropping.

I assume that the handover to the priority 0 opensm hasn't worked
then. For additional information: This test was done on a
point-to-point connection between 2 adapters.

Roland
Hal Rosenstock
2005-04-12 17:00:17 UTC
Permalink
Post by Roland Fehrenbacher
Hal> SM election occurs per high priority low GUID. So if you
Hal> don't care which SM is the master than you don't need to do
Hal> anything. If you want a specific order (and it is not in GUID
Hal> order) then you need to specify priority.
Ok. I tried this, specifying priority 0 on one server, and priority 15
on another one. I assume priority 15, will be the master.
If I first start the priority 0 opensm, and then the priority 15 one,
things look normal: Log excerpts
priority 0 server
Apr 12 18:41:06 [4000] -> OpenSM Rev:openib-1.0.0
Apr 12 18:41:06 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c2.
Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c2.
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0001 TID:0x0000000000000011
Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0001 GID:0xfe80000000000000,0x0002c902004013c2
Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000d
Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a
Apr 12 18:42:25 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000e
Apr 12 18:42:25 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a
priority 15 server
Apr 12 18:42:25 [4000] -> OpenSM Rev:openib-1.0.0
Apr 12 18:42:25 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port 0x2c9020040133a.
Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port 0x2c9020040133a.
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
When I kill the priority 15 server however, the priority 0 server runs
Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with error(method=1 attr=20) -- dropping.
Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with error(method=1 attr=20) -- dropping.
Attribute 0x20 is SMInfo. This is just the SubnGet(SMInfo) from the
priority 0 server failing (no matching SubnGetResp received) which is
"normal" if you killed the priority 15 server.

Do the messages ever subside ?
Post by Roland Fehrenbacher
I assume that the handover to the priority 0 opensm hasn't worked
then.
This isn't really handover but that is another matter.
You should be able to use the sminfo diag to see whether this SM has
assumed the MASTER role.

-- Hal
Eitan Zahavi
2005-04-12 05:07:11 UTC
Permalink
Hi Roland,

If the case is reproducible, please run "opensm -V" and send us the osm.log

Thanks

Eitan Zahavi
-----Original Message-----
Sent: Monday, April 11, 2005 7:28 PM
Subject: [openib-general] OpenSM (again)
Hi,
I got gen2 opensm running fine now (there was a problem with a wrong
include file), and managed to get IP running on a network of
currently 40 machines (final size will be 144). Performance is pretty
impressive (initial tests with a simple netpipe): I got a latency of
18microsec, and a maximum throughput of approx. 400MB/sec at packet
size approx. 1MB which then levels of at about 340MB/s for larger
packets.
Problem: When I reboot all the 40 nodes (apart from the one the opensm
is running), the network is non-functional (no pings go through, even
though ports show status "Active") for quite a while (more than 10
minutes) after all the nodes have come up. It then recovers without
intervention. Is this normal? Single node reboots don't affect the
network operation. osm Log file is appended.
Question 1: Can I run opensm in a master slave configuration? I noticed
that there is a priority commandline option, but am not sure how to
apply this.
Question 2: I plan to run the gen1/Mellanox IBGD drivers on the
compute nodes (need fast MPI), and gen2 on the control/storage nodes
(need only IP) with gen2 opensm running on the control nodes. Is there
any reason why this should not work reliably?
Roland
Tziporet Koren
2005-04-12 05:36:52 UTC
Permalink
Post by Roland Fehrenbacher
Question 2: I plan to run the gen1/Mellanox IBGD drivers on the
compute nodes (need fast MPI), and gen2 on the control/storage nodes
(need only IP) with gen2 opensm running on the control nodes. Is there
any reason why this should not work reliably?
We tried it in Mellanox once and it did work properly (we used OpenSM from
gen1 and IPoIB from gen1 & gen2 on 2 different machines). So although its
not QAed I see no reason that it will not work for you.

Tziporet
Hal Rosenstock
2005-04-12 09:56:29 UTC
Permalink
Post by Roland Fehrenbacher
Problem: When I reboot all the 40 nodes (apart from the one the opensm
is running), the network is non-functional (no pings go through, even
though ports show status "Active") for quite a while (more than 10
minutes) after all the nodes have come up. It then recovers without
intervention. Is this normal? Single node reboots don't affect the
network operation. osm Log file is appended.
______________________________________________________________________
Apr 10 15:05:55 [4000] -> OpenSM Rev:openib-1.0.0
Apr 10 15:05:55 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
Apr 10 15:05:55 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 10 15:05:55 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 10 15:05:55 [4000] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x2c902004013c1) as the default port.
Apr 10 15:05:55 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c1.
Apr 10 15:05:55 [4000] -> osm_vendor_bind: Unable to register class 129 version 1.
Apr 10 15:05:55 [4000] -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind() failed.
Apr 10 15:05:55 [4000] -> osm_sm_bind: ERR 2E10: SM MAD Controller bind() failed (IB_ERROR).
Apr 10 15:06:58 [4000] -> OpenSM Rev:openib-1.0.0
Apr 10 15:06:58 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
Apr 10 15:06:58 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 10 15:06:58 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 10 15:06:58 [4000] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x2c902004013c1) as the default port.
Apr 10 15:06:58 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c1.
Apr 10 15:06:58 [4000] -> osm_vendor_bind: Unable to register class 129 version 1.
Apr 10 15:06:58 [4000] -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind() failed.
Apr 10 15:06:58 [4000] -> osm_sm_bind: ERR 2E10: SM MAD Controller bind() failed (IB_ERROR).
Apr 10 15:07:44 [4000] -> OpenSM Rev:openib-1.0.0
Apr 10 15:07:44 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
Apr 10 15:07:44 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 10 15:07:44 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 10 15:07:44 [4000] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x2c902004013c1) as the default port.
Apr 10 15:07:44 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c1.
Apr 10 15:07:44 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c1.
Apr 10 15:07:44 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0011 TID:0x000000000000000a
Apr 10 15:07:44 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping.
This is a SubnGet of NodeInfo which is timing out.
Post by Roland Fehrenbacher
Apr 10 15:07:45 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping.
Apr 10 15:07:45 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping.
Apr 10 15:07:45 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping.
Apr 10 15:07:45 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping.
Apr 10 15:07:45 [2400A] -> umad_receiver: send completed with error(method=1 attr=16) -- dropping.
This is a SubnGet of PkeyTable which is timing out.
Post by Roland Fehrenbacher
Apr 10 15:07:45 [2400A] -> umad_receiver: send completed with error(method=1 attr=16) -- dropping.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
These are SA MADs being received when SM is not yet ready to handle
them. They could be SA sets of MCMemberRecord (from IPoIB). SA clients
in end nodes should retry them (assuming not exhaust their timeout/retry
strategy).

For debug purposes, it might be nice to display the method and attribute
of the SA MAD.
Post by Roland Fehrenbacher
Apr 10 15:07:46 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SELF.
Apr 10 15:07:46 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET.
Apr 10 15:07:46 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET.
Apr 10 15:07:46 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET.
Apr 10 15:07:46 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET.
Apr 10 15:07:46 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET.
Apr 10 15:07:46 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping.
Apr 10 15:07:46 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping.
Apr 10 15:07:46 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping.
Apr 10 15:07:46 [2400A] -> umad_receiver: send completed with error(method=1 attr=16) -- dropping.
Apr 10 15:07:46 [2400A] -> umad_receiver: send completed with error(method=1 attr=16) -- dropping.
Apr 10 15:07:47 [2400A] -> umad_receiver: send completed with error(method=1 attr=16) -- dropping.
Apr 10 15:07:47 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET.
Apr 10 15:07:47 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET.
Apr 10 15:08:16 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:08:16 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:08:16 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
In the most recent OpenSM (gen1), this has been changed from error to warning. (That doesn't explain the delay in connectivity).
Post by Roland Fehrenbacher
Apr 11 08:32:17 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0028 TID:0x000000000000004c
Apr 11 08:32:17 [18007] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0028 GID:0xfe80000000000000,0x0002c9010befe900
Apr 11 08:32:17 [18007] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x0011 GID:0xfe80000000000000,0x0002c902004013c1
Apr 11 08:32:17 [18007] -> Discovered new port with GUID:0x0002c902004012e9 LID range [0x3D,0x3D] of node:MT23108 InfiniHost Mellanox Technologies
Apr 11 08:32:17 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
Apr 11 08:35:27 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0028 TID:0x000000000000004d
Apr 11 08:35:27 [18007] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0028 GID:0xfe80000000000000,0x0002c9010befe900
Apr 11 08:35:27 [18007] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0011 GID:0xfe80000000000000,0x0002c902004013c1
Apr 11 08:35:27 [18007] -> Removed port with GUID:0x0002c902004012e9 LID range [0x3D,0x3D] of node:MT23108 InfiniHost Mellanox Technologies
At what point, did it start working again ? Was it at 15:24 ? (That
appears to be a 16-17 minute delay in connectivity).

-- Hal
Eitan Zahavi
2005-04-13 05:50:25 UTC
Permalink
FYI: OpenSM implements master handover in a "lazy" or "less intrusive"
manner:

OpenSM will only handoff a subnet to the new master on a heavy sweep
sequence.
So if you start an SM and then start one with higher priority - the handoff
will not happen unless there was some change in the subnet (trap or switch
"change bit").

The main reason for this behavior is the concept of "light sweep" that
minimizes the discovery to checking of "change bits" and now also
"irresponsive ports". So the new SM is not even discovered by the SM.

The benefit is that as long as there is no change in the subnet the active
SM does not transfer the ownership to the new one - which has an overhead on
the entire subnet
(client re-registration or even LID changes).

This behavior is compliant as the spec says:
C14-60.2.1: If a Master SM finds another Master SM with lower priority (or
same priority and higher GUID) it shall ensure that it is the highest
priority
(or same priority and lower GUID) on the subnet, and if so it shall wait for
the other Master (or Masters) to relinquish control if its portion of the
subnet.
C14-61.2.2: If a Master SM determines that a lower priority Master SM
has not performed a handover within a vendor-specific time period, then
it shall not change the state of the subnet.

Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL
-----Original Message-----
Sent: Tuesday, April 12, 2005 8:00 PM
Subject: Re: [openib-general] OpenSM (again)
Post by Roland Fehrenbacher
Hal> SM election occurs per high priority low GUID. So if you
Hal> don't care which SM is the master than you don't need to do
Hal> anything. If you want a specific order (and it is not in GUID
Hal> order) then you need to specify priority.
Ok. I tried this, specifying priority 0 on one server, and priority 15
on another one. I assume priority 15, will be the master.
If I first start the priority 0 opensm, and then the priority 15 one,
things look normal: Log excerpts
priority 0 server
Apr 12 18:41:06 [4000] -> OpenSM Rev:openib-1.0.0
Apr 12 18:41:06 [4000] -> osm_opensm_init: Forcing single threaded
dispatcher.
Post by Roland Fehrenbacher
Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice
type:3
num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Post by Roland Fehrenbacher
Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice
type:3
num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Post by Roland Fehrenbacher
Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port
0x2c902004013c2.
Post by Roland Fehrenbacher
Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port
0x2c902004013c2.
Post by Roland Fehrenbacher
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an
Invalid Delete Request.
Post by Roland Fehrenbacher
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an
Invalid Delete Request.
Post by Roland Fehrenbacher
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an
Invalid Delete Request.
Post by Roland Fehrenbacher
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an
Invalid Delete Request.
Post by Roland Fehrenbacher
Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received
Generic
Notice type:0x04 num:144 Producer:1 from LID:0x0001 TID:0x0000000000000011
Post by Roland Fehrenbacher
Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice
type:4
num:144 from LID:0x0001 GID:0xfe80000000000000,0x0002c902004013c2
Post by Roland Fehrenbacher
Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received
Generic
Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000d
Post by Roland Fehrenbacher
Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice
type:4
num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a
Post by Roland Fehrenbacher
Apr 12 18:42:25 [18007] -> __osm_trap_rcv_process_request: Received
Generic
Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000e
Post by Roland Fehrenbacher
Apr 12 18:42:25 [18007] -> osm_report_notice: Reporting Generic Notice
type:4
num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a
Post by Roland Fehrenbacher
priority 15 server
Apr 12 18:42:25 [4000] -> OpenSM Rev:openib-1.0.0
Apr 12 18:42:25 [4000] -> osm_opensm_init: Forcing single threaded
dispatcher.
Post by Roland Fehrenbacher
Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice
type:3
num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Post by Roland Fehrenbacher
Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice
type:3
num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Post by Roland Fehrenbacher
Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port
0x2c9020040133a.
Post by Roland Fehrenbacher
Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port
0x2c9020040133a.
Post by Roland Fehrenbacher
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an
Invalid Delete Request.
Post by Roland Fehrenbacher
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an
Invalid Delete Request.
Post by Roland Fehrenbacher
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an
Invalid Delete Request.
Post by Roland Fehrenbacher
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an
Invalid Delete Request.
Post by Roland Fehrenbacher
When I kill the priority 15 server however, the priority 0 server runs
Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with
error(method=1
attr=20) -- dropping.
Post by Roland Fehrenbacher
Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with
error(method=1
attr=20) -- dropping.
Attribute 0x20 is SMInfo. This is just the SubnGet(SMInfo) from the
priority 0 server failing (no matching SubnGetResp received) which is
"normal" if you killed the priority 15 server.
Do the messages ever subside ?
Post by Roland Fehrenbacher
I assume that the handover to the priority 0 opensm hasn't worked
then.
This isn't really handover but that is another matter.
You should be able to use the sminfo diag to see whether this SM has
assumed the MASTER role.
-- Hal
_______________________________________________
openib-general mailing list
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general

Loading...