Intermittent problems sending order confirmation mails

We have intermittent problems sending order confirmation mails. Sometimes they are sent out, sometimes they aren’t, and the root cause seems to be that

  1. The mail service can’t get the HTML for the mail because the web request fails
  2. The scheduler doesn’t retry the send even though it says it will

Here is the actual error:

Execution of background processor Litium.Application.Scheduler.BackgroundJobWorker, Litium.Application failed, executing again in 5 seconds.

System.Net.WebException: Unable to connect to the remote server —> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond [load balancer external IP]:443
at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
at System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Exception& exception)
— End of inner exception stack trace —
at System.Net.HttpWebRequest.GetResponse()
at Litium.Accelerator.Services.MailServiceImpl.MailServiceProcessor.GetWebPageContent(…

Only one error appears per failed order confirmation mail send, and only this error.

While beginning to troubleshoot this, I have some questions:

  1. Is it correct/normal that it should use the external load balancer IP to fetch the mail body? If not, where would I configure this?
  2. Does the scheduler run on all servers in a cluster? I can’t find any references to the scheduler in configuration.
  3. Why does the scheduler not retry the send even though it says it will in “5 seconds”? Is this something that needs to be configured?
  4. Some other pointers on how to troubleshoot?

Litium version: 7.6.1 + accelerator @ Litium Cloud production environment

Connecting to the external load balancer IP from the cluster servers does not work. At least not for http/s. And it seems one of the servers in the cluster did not have the external site address hosts-filed to 127.0.0.1 but the others did. So the problem was most likely that one server tried to fetch HTML externally but failed.

So it seems that unless the “server address” can be easily configured (my question 1 above) all cluster nodes must resolve the external site address to an internal address (either a cluster node or localhost).

Still interested in questions 2 and 3 above even though the issue may have been resolved. :slight_smile:

For the 2 and 3 you have the answer here…

  1. Yes, it is executed on all servers in the server farm and distribute the work between the servers, only one server process a scheduled job.
  2. The message are misleading, it’s the BackgroundJobWorker that will execute after the delay of 5 sec and that trying to find next scheduled message that not already have been processed. For the already processed message; in this case the sending of mail, have been marked as failed and will not be processed again.
2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.