I have recently faced a typical issue in my application recently. I would appreciate any guidance to progress further on troubleshooting of the issue.
I am trying to provide as much information as I can for your review below.
I have an application based on JSP and Servlets, the web container is Tomcat 6.0.
The front interface is Apache which redirects the incoming request to Tomcat 6.0. The Apache to Tomcat redirection is set web modjk connector.
Apache is configured to have a secured protocol - https with SSL certificate.
The application is having good error logging mechanism for exceptions, which gets stored in DB. There is also a provision to log DB connection errors in Tomcat logs so that if somehow the application could not write to DB due to a DB error, that log will appear in the Tomcat logs.
The application was running fine without any issues until recently when we suddenly had a situation in which some of the online processes of the application started behaving in unexpected manner. We had 5 steps in one process, which got lost after two steps, before moving to the third step. The users always saw an error message, which is a custom error message raised by the application on account of the exception. However, the most unexpected situation was that there were no logs in the DB error log OR Tomcat logs. There was no error indicated by those mechanisms.
It was looking like the application did encounter an error, every time after two steps in the online process but somehow couldn't log it anywhere. We are not sure how would that happen. We did the following checks to make sure that the application is having everything in place to handle the situation:
1. DB Error logging - we verified again that in case of exception, in any given situation the application would log the error stack trace and message in the DB. There are no chances that would be missed.
2. Alternative logging - The application is having a proper mechanism to log a DB connection related error in Tomcat logs. Therefore, if we consider a situation in which the application just couldn't write to DB for an error logging, that error would have gone to Tomcat logs.
3. Testing - The testing is carried out in two replica environments for couple of weeks but no issues in those environments.
4. Apache to Tomcat redirection - We also thought at one point of time that could this be a redirection issue. However, that was ruled out because the users see the application custom error message every time after the second step, which indicates that the request is reaching properly to Tomcat in all instances.
The problem got resolved by Tomcat restart. Which again doesn't explain the solution as the root cause is still unknown.
Based on my experience so far, there are two situations which Tomcat restarts normally resolves:
1. Out of Memory errors - When a memory leak in the application causes "Out of Memory" after a continuous processing over a period of time, Tomcat restart resolves the error as it clears the memory.
2. Connection over limit issues - When there is a connection leak, the application runs out of connection over a period of time. Either the DB cursors get over limit or the connections managed by the application get over the limit of total number of connections allowed by DB. Tomcat restarts releases the connections and resolves the issue.
However, in the recent issue of my application, none of these were the root cause as the application was progressing until the third step in the online process, every time. The first two stages do use DB connections.
I have searched for more than a week for similar issues or problems on internet but couldn't find anything so far. Based on that, I thought it is now time to take expert advise as I have done everything that I could. I am continuing my research further but any help would be very very helpful in the current situation.
Looking forward to your kind attention and comments/suggestions to help problem investigation.
Are you also using a catalina.out logfile (i.e. the redirected system out/err). If so, add a System.(out/err).println to all the places where you are doing the other "logging" and see if the exception shows up in the stdout log. Also, set a default error page for the application that does nothing but show the exception, it's stacktrace, and the root cause and it's stacktrace, then, after attempting your logging simply rethrow the exception.