RTI Performance issues, Part 1

chulett · Post by **chulett** » Tue Jun 10, 2008 9:30 am

Looking for some Words of Wisdom as I am getting beaten up black and blue over 'RTI performance' or lack thereof. I said "Part 1" as I thought I'd start with the simplest case and, if time allows, follow up with the more complex ones. Please see this post which explains our topology.

I have four services that have been in use by the application they were written for since early 2006 without issue. Earlier this year they were offered up to another group who needed to do 'the same thing' so they didn't have to write their own services as these were already in place. They've then proceeded to beat the hell out of these services (and me), calling them at a much higher volume than anything we had in mind when they were written. I want to start with the simplest, their newest sibling which was put together just for this upstart group.

We dish out surrogate keys in our ETL via a modifed version of the sdk routine that supports concurrency. About the only difference is the hashed file name it uses and some minor if-the-else work to handle the requirements we had on compound surrogates. I have zero performance issues with this in Server jobs and can pump thousands of rows per second 'through' it. However, we don't ever really stress the 'concurrency' aspect of it, not having multiple processes calling it for the same key at the same time. They do.

The job is simplicity itself and took all of two seconds to cobble together. Three stages:

RTI Input --> Transformer --> RTI Output

The input accepts a string, the transformer calls the routine and pass back out a number. Wham bam, thank you ma'am. It's running 'Always On' with three instances... which now I'm wonder if that was a mistake. Only change to the default deployment values was to change the default queue from 3 to 100, I believe.

They are reporting wait times of multiple minutes at times to get a response back from this service. Which means their processing times out. I can get some details on exactly how they are calling this from the Java, if that would help. I don't really understand if the 'RTI Enabled' part of all this is the bottleneck, or if running multiple instances rather than just one with a huge queue perhaps is biting me in the butt and causing locks inside the routine.

How can I determine the problem? Any suggestions for things I can do to improve the situation? Thanks.

lstsaur · Post by **lstsaur** » Tue Jun 10, 2008 1:32 pm

In your shop, do you have either Apache's tcpmon or Oracle's Jdeveloper that can be used to analyze your HTTP packet that is sent between your RTI service client and an RTI server? With that information, you can determine exactly where the bottleneck is and what causes the problem.

I would also increase the "Wait Delay" time in addtion to the Queue Size that you already changed from 3 to 100.

chulett · Post by **chulett** » Tue Jun 10, 2008 1:36 pm

Thanks for the info, I'll check. Probably have the Apache tool here somewhere, not sure about the Oracle one. And I'll bump up the Wait Delay as well. This is good timing as I just found out it's time for another 'load test' in the QA environment tonight.

Any idea what a reasonable value to raise that to would be?

lstsaur · Post by **lstsaur** » Tue Jun 10, 2008 4:11 pm

Change the default Wait Delay from 100 ms to 10000 ms.

chulett · Post by **chulett** » Tue Jun 10, 2008 4:22 pm

Thanks, I'll give that a shot.

eostic · Post by **eostic** » Tue Jun 10, 2008 8:24 pm

Multiple "minutes"? Something else must be going on. The suggestions on watching the packets are key here, as are knowing the "conditions" of when the multiple minutes occurs. What kind of increase in volume are you talking about....... talk with the java folks and capture two variables in particular....the rate for which a given client issues requests (such as an automated application issing SOAP calls in a tight loop), and the number of concurrent clients.

Also...when the multiple "minutes" occurs, what else is going on? Do ALL the clients experience multiple minutes?....or just one here and there?

...and do none of them fail with NO HANDLER AVAILABLE errors? The queue settings will impact that, without affecting or improving response time. In other words, if you found that everyone was getting NO HANDLER because of reponse time issues, you could increase the queue to simply "accept" the poor response time but still have clients successfully serviced.

I wonder if the requests are even reaching the HTTP servlet.

Ernie

chulett · Post by **chulett** » Tue Jun 10, 2008 9:14 pm

Multiple minutes? Yah, or so I'm told. Getting useful information from them can be like pulling teeth, so much of the time I... punt. Usually I get 'requests are timing out' and they send me examples of the 'problem XML', which I proceed to run standalone without issue.

As to the actual volumes, original user counts were pretty small, typically less than 10 concurrent users. The number now is probably 10 to 20 times that, spread across the day from the mid-west to the east coast to off-shore locations as well. And yes, their application can make multiple automated calls in a 'tight loop' to our services, and each singleton call could do anywhere from a handful to multiple thousands of database operations during the call. And, as far as I know, the issues don't occur across all client sessions but more like the 'one here and there' you mentioned.

The only time I recall getting the NO HANDLER messages was when the services were actually down. Let me dig up one of the messages they've sent me:

javax.ejb.EJBException: Exception trying to invoke operation X.
Timed out waiting for an operation handler to become available for Operation 'X'.

Guess not really all that different, but looking like something we're working around by extending the queue sizes. Wish I knew what I could do to trace this activity and know what is actually going on with it. I'll see if I can get the stats you've asked about from them.

This may all be academic as they seem to have finally put a workaround into QA that I suggested - they no longer call this service and 'share' our surrogate sequences but instead have switched to Oracle sequence objects, initialized to start several bajillion ahead of where ours currently are so as to not run into each other. I just saw the load test results and they allegedly ran 150 concurrent users without issue whereas in the past they couldn't get past 50. So I may be finally off the meat hook without having to lose too many more brain cells in the process.

Let's see what comes of this in the next couple of days and if there is any need to pursue this... or part two.

Thanks guys.