Important SAP HANA HotNews about database corruption

SAP released a HotNews related with database corruption within HANA databases. SAP HotNews are SAP Note with Very High/Critical priority that should be checked in advance and as soon as possible. The HotNew I’m talking about is SAP Note 2370160 – Possible Rowstore Table Corruption When Continuous Page Flush is Enabled that describes a possible data corruption in rowstore tables.

You should check it urgently if your HANA database is on SPS12 with a patch level below Patch 4, in case you have corrupted data the only solution is to restore a database backup!

The problem

As described in the SAP Note 2370160 the problem is related to a programming error in disk write optimizations that comes as a new feature in HANA SPS12. This SPS introduced a new feature named continuous page flush for reducing the runtime of the savetime by flushing pages to disk between savepoints. This improves the performance and disk use specially during the savepoint operation within the database.

This programing error makes that entries in row store tables became inconsistent, appearing duplicates and even data loss. If the database parameter continuous_flush_interval_s is enable then your database is affected. Please keep in mind than this parameter is enable in SPS12 by default!

The issue could appear in the following cases:

  • When deleting data on the affected table. In this case it will happen a crash on the indexserver, resulting on the following error in the indexserver trace:

[CRASH_STACK]  Stacktrace of crash: (0000-00-00 00:00:00 000 Local)
—-> Pending exceptions (possible root cause) <—-
exception  1: no.1000000  (ptime/storage/mm/mm_allocate.cc:1944)
    Assertion failed: page->isOccupied(idx) != is_occupied
exception throw location:
 1: 0x00007f4eb36fa3f7 in ptime::MemoryMgr::markSlotsInPage(ptime::Transaction*, RowEngine::RidSet::PageInfo const&, bool, bool, bool, bool)+0x393 at mm_allocate.cc:1944 (libhdbrskernel.so)
 2: 0x00007f4eb38fbeaf in RowEngine::RowTableManagerImpl::markSlotsDeleted(ptime::Transaction*, RowEngine::RidSet::PageInfo&, ptime::IndexUpdate*, bool, bool, ptime::fastvector<unsigned char*, 16ul>&)+0x18b at RowTableManagerImpl.cc:2357 (libhdbrskernel.so)
 3: 0x00007f4eb38fc538 in RowEngine::RowTableManagerImpl::markSlotsDeleted(ptime::Transaction*, RowEngine::RidSet&, ptime::IndexUpdate*, bool, bool)+0xd4 at RowTableManagerImpl.cc:2264 (libhdbrskernel.so)
 4: 0x00007f4eb38fc732 in RowEngine::RowTableManagerImpl::flushBulkDelete(ptime::Transaction*, bool, bool)+0x110 at RowTableManagerImpl.cc:2162 (libhdbrskernel.so)
 5: 0x00007f4eb38fc841 in RowEngine::RowTableManagerImpl::endBulkDelete(ptime::Transaction*, bool)+0x10 at RowTableManagerImpl.cc:2187 (libhdbrskernel.so)
 ….

  • When doing a consistency check for the database. In this case you will see the error message 5986 and the following text in the process log:

pointing to free variable part found;rowid=xxx;offset=8;typeid=37;ext_rowid=xxx

In the first case you can check the indexserver trace which is located in the directory /usr/sap/SID/HDBXX/hostname/trace/ in case of having a SDC database or in /usr/sap/SID/HDBXX/hostname/trace/ DB_SID in case of MDC database. On second case you can follow the SAP Note 1977584 – Technical Consistency Checks for SAP HANA Databases

If you have an IO performance problem within your disks the possibilities of the issue appearing will increase. You can check the IO performance executing the following command:

The %util colum is a good indicator of performance issues:

IOStat command execution
IOStat command execution

The solution

In order to solve the issue we need to apply SPS12 Patch 4 or above. Currently the lastest patch available is Patch 5 which was released on 15.12.2016. There is a workaround to temporary fix the issue and prevent data corruption, the steps to do in our database are the following ones:

  • Do a backup of your database. This is not included in the SAP Note but I will strongly recommend to do a backup before doing anything elase.
  • Change the configuration of the database changing the parameter continuos_flush_interval_s to 0. We have to change the parameter in each database if we have a MDC database (SYSTEMDB and tenant databases). We can do it with the SAP HANA Studio or executing the following query using hdbsql:

  • Get the PID of the indexserver (in case of tenant databases for tentant databases and SDC) and namesever (SYSTEMDB in MDC databases). We can do that executing the following commands:

  • Using those PID from last step we use the hdbcons application with 2 different options. The hdbcons utility is a HANA kernel utility used for troubleshooting within the SAP HANA database. The commands to execute are the following ones:

In case we have a MDC database we have to execute the commands for both indexserver and nameserver PID. When we execute the first command the utility will show the number of pages marked to be writted to disk in the next savepoint. Using the number of pages we can calculate how long the second command it will take. Please keep in mind that during the execution of the second command the SAP system won’t respond and it will be frozen. The best idea in this case is to execute the second command with SAP AS completely stopped and the database running.

In our case it took around 2 and half hours to execute the second command. The SAP system didn’t respond for at least 1 hour.

  •  When the hdbcons command finish you can execute the following queries to check if there is any corruption within the dabase. You can use SAP HANA Studio or hdbsql utility for doing so.

You will have to execute them in all your databases in case of MDC database. If the database doesn’t show any result from the two queries it doesn’t have any data corruption. Congratulations!

Query result showing no database corruption
Query result showing no database corruption

If there is any result showing a table corrupted the only way to solve it will be to restore a database backup taken before the database corruption. I recommend to open a support message to SAP Support Portal so they will check if there is any other way to solve it.

  • Remember to change the database parameter continous_flush_interval_s to the default value once you updated the database to Patch 4 or above. You can do it executing the following query:

Conclusions

I talked before about the importance of updating and patching our databases, systems and applications regularly and following a established plan. In this case the issue was reported as a HotNews so it will be quite difficult to update the database in such a sort period of time.

For this kind of situations we can schedule period maintenance windows within our landscape so it will be easier to make a plan and fix the issue before any problem happens. I strongly recommend to check the SAP HotNews and Security Notes as soon as we can. Just check a regular appointment on your calendar with an alarm each 15 days or month and try to spend some minutes reading them. You can check the SAP HotNews here and the SAP Security Notes here.

In case of the SAP HotNews it will be possible to define which components and products we want to check by modifying the SAP Component, System or Category options in the upper part:

SAP HotNews in the Support Portal
SAP HotNews in the Support Portal
Rate this post

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.