Ticket #2517 (closed enhancement: fixed)

Opened 7 years ago

Last modified 7 years ago

Optimize verification of live incremental upgrades

Reported by: cscott Owned by: mstone
Priority: high Milestone: Update.1
Component: upgrade utility Version:
Keywords: Cc: cscott, mstone
Action Needed: Verified: no
Deployments affected: Blocked By:
Blocking:

Description

Live upgrades, based on "latest version" data received from antitheft server.

Attachments

times.txt (4.2 kB) - added by mstone 7 years ago.
Profiling data on an update run

Change History

Changed 7 years ago by cscott

Changed 7 years ago by cscott

  • cc cscott, mstone added
  • owner changed from cscott to mstone

Reassigning to mstone, as he's taking the lead for this bug. We'll going to press real hard to get some version of this into the MP build.

Changed 7 years ago by kimquirk

  • priority changed from normal to blocker

This should have been a blocking bug for Trial-3 since upgrades have to work for the code that gets loaded onto 40,000 laptops.

Changed 7 years ago by mstone

  • priority changed from blocker to high
  • summary changed from Live incremental upgrades to Optimize verification of live incremental upgrades
  • type changed from defect to enhancement
  • milestone changed from Trial-3 to Untriaged

As of build 595, the feature "live incremental upgrades" works.

Unfortunately, it works slowly and uses lots of RAM: it takes about 30 minutes to perform an update (with full manifest verification). Preliminary measurements indicate that there is substantial room for improvement in both RAM and time usage.

Therefore, I'm changing the summary, type, priority, and milestone to indicate that this is now an optimization bug.

Changed 7 years ago by mstone

Profiling data on an update run

Changed 7 years ago by mstone

Our 30-minute run time seems to be overwhelmingly dominated by a time spent in our JSON parser. Unfortunately, most other JSON parsers, including simplejson and python-cjson, are optimized with C extensions. We would like to avoid C extensions in order to reduce the risk of buffer overflows, which adds slightly to the cost of using one of these other parsers.

Hence: how much developer time are we willing to spend to get a 20-minute run time, a 15-minute run time, or a 10-minute run time?

Changed 7 years ago by jg

  • component changed from distro to upgrade utility
  • milestone changed from Untriaged to First Deployment, V1.0

Quite a bit. You may not be able to avoid using C extensions, unfortunately. Even 10 minutes is *insane*. If upgrades take too long, people will start avoiding upgrading, which will be an even bigger security threat; they should be fast.

Changed 7 years ago by cscott

I reimplemented this in 545 lines of C. Speed on XO B4, checking build 602:

  • SHA + RMD: 6m 6s (4m30s user)
  • SHA only: 4m46s (3m11s user)
  • RMD only: 2m54s (1m20s user)
  • no hash : 1m37s ( 3s user)

One and a half minutes of the time are spent in the kernel getting the data off the NAND. There's some busy-waiting going on here: dwmw2 indicated that we need to pause for 1ms to let the data arrive from NAND, and if we yielded the kernel, we wouldn't get back for 10ms.

The RIPEMD160 hash is about twice as fast as SHA256 in this implementation (libgcrypt). It's also much weaker than SHA256, but maybe we don't need the extra strength. Clearly we should disable one or the other of the algorithms; we can't afford to double our hashing time.

We might also look for better implementations of the hash functions. Libgcrypt has almost certainly not been tuned for our architecture; I don't have a good sense for how its speed compares to (say) libtomcrypt or openssl. Getting a little bit more performance out of RIPEMD, for example, would make manifest verification really look good.

Changed 7 years ago by mstone

cscott and I have agreed that we will further control for filesystem corruption by using RIPEMD and stat() checks on every file in the new tree but that we will run full checking only on inodes not contained in the old pristine tree.

We did not decide what kind of checking should be performed on directories.

Changed 7 years ago by cscott

Times with libtomcrypt, compiled with gcc-4.3 -march=geode -O9:

* SHA+RMD: 5m17s (3m 5s user) * SHA only: 4m 1s (1m56s user) * RMD only: 3m 7s (1m11s user) * no hash: 2m 9s ( 2s user) Note that my system times are significantly larger on this run, probably because I yum installed some stuff (gcc, etc) on my XO, so the filesystem size is larger and JFFS2 is having to work harder.

Compiling with -march=geode -O2 (since it's possible that -O9 creates suboptimal code) yields:

* SHA+RMD: 5m12s (3m9s user)

So we might as well compile with -O2.

So, libtomcrypt yields a little benefit for RIPEMD-160, but a decent amount of SHA256 improvement. It's worth using.

Changed 7 years ago by cscott

  • status changed from new to closed
  • resolution set to fixed

In build 607; closing the bug.

Note: See TracTickets for help on using tickets.