A key functionality of HFM allows several consolidations, data loads, extracts, emails, etc. to be scheduled and automated via a highly customizable operation called a task flow. These task flows allow for jobs to be completed without manual input between each step which can be a serious time saver, but they also present the opportunity for unforeseen HFM failures which can set you back further than if you simply did it manually.
For example, User A starts a task flow they expect to take several hours and then begin their other work. Later that day they come back to the task flow expecting to start their work from where the task flow finished, except that the task flow failed during step two of fifteen and did not progress any further. Now the user is behind and it is your job to find out why and prevent it from happening again.
This recently created several problems for a client running consolidations during evening and night time hours. They would return in the morning to find that the task flow had failed with an error: "EPMHFM-66030: An unexpected communication error has occurred” or “EPMHFM-66076: Server XYZ123 is unavailable, connection could not be established.”
After reading through the logs, checking every timeout settings possible, and numerous test cases we found that the client’s network was dropping packets for extremely short periods of time (around 5-10 seconds). Although there was already monitoring in place, the intervals were set at 5 minutes and were too broad. For troubleshooting purposes, a custom script to ping the server every five seconds and record dropped packets was put into place. It was seen that the timestamps for the above errors aligned exactly with dropped network packets for the HFM server running the consolidations. While the HFM server was still up and working on the consolidation, the Foundation server was unable to contact it over the network, and stopped the task flow.
Currently there is no permanent fix for this (other than a more stable network). There is retry logic build into HFM if the network outage is less than five seconds, but for a longer duration HFM may not reconnect properly. The only workaround Oracle gives to properly recover the application is to restart HFM services. At this time, Oracle has deemed this as unfeasible to fix.