2009-06-16

Spirit Sol 161

I walk into an empty room. Usually, the place is buzzing with activity -- or, if not buzzing, at least murmuring quietly. Today it's empty.

Dan Moyers clears up the mystery when he arrives a few minutes later. Once again, there's been a late schedule change. This time, at least, they moved the meetings later, not earlier. So they're all upstairs, and I'm extra-early, as it turns out.

Which is probably a good thing, because there's another mystery. Thisol's IDD sequence errored out, and nobody knows why. That is, they know the command that caused the error -- they just don't know how it got to the spacecraft, because it doesn't seem to be in the IDD sequences the rover planners wrote yestersol.

Which is very, very disturbing.

While I'm waiting for everyone else to show up, I start investigating, and I soon have a working theory. It looks like they delivered the wrong version of the sequence, one that included so-called overdrive correction commands that aren't meant to be sent to the spacecraft.

The overdrive correction commands are needed because RSVP doesn't know about a fundamental law of the universe -- that two objects can't occupy the same space at the same time. Because there's always a little bit of uncertainty in the system as to the exact location of the terrain with respect to the rover, we deliberately tell the IDD to move farther than we think it needs to when placing an instrument on rocks or soil. The rover's contact sensors detect when the instrument actually touches the terrain, and the arm stops. It's a little like feeling your way around in a dark room -- you gently extend your hand toward where you think the wall is, and stop moving when you touch it.

Only RSVP's simulation doesn't stop when the arm touches the terrain. If you tell it to put the arm half a meter inside a rock, it cheerfully goes that far in. This mismatch between the behavior of the real and simulated arms leads to other problems when we're sequencing follow-up motions -- just recently, the real arm shut down because it detected an impending self-collision, a self-collision that didn't show up in the simulation. As a result, we add commands to fix up the simulation -- for every command where the simulation puts the arm inside a rock or under the ground, we add a command that pulls it back out to a more plausible position. With these added commands, the simulation behaves much more like the real thing. When it's doing what we want, we remove the overdrive correction commands and send the rest to the real rover.

Only ... that doesn't seem to have happened this time. Deepening the mystery, when we look at the files that were delivered yesterday, the overdrive correction commands aren't there. But when we look at what was sent up to the spacecraft, two of them were included.

Eventually, a bunch of us -- John, Sharon, Marc Pack, I, and others -- manage to solve the mystery together. They had to change one of the IDD sequences yesterday, after an initial version had already been delivered.[1] That initial delivery had the wrong sequences, but the second time, one of the sequences was updated and one wasn't. For complicated reasons, the reviewed versions of the sequences weren't the ones that we sent up -- we reviewed the right ones, without the overdrive correction. So one of the sequences sent to the rover still had overdrive correction commands in it, but those commands weren't in the version everyone was looking at.

That last part is the really scary part. We have a system in place that ordinarily guarantees that the versions of the sequences that get sent up to the spacecraft are the same as the versions we review. That system can break down, though, when a sequence is updated late in the game, as happened yesterday. Fortunately, Sharon and Marc already have most of a script that can compare the reviewed versions of the sequences to the uplinked versions. They immediately set to work arranging for the script to be run as part of the uplink process every sol, which should help catch this problem in the future.

It makes me think, though. All that happened as a result of this mistake is that we lost most of a sol -- the spacecraft is fine. But one of these days, we might be telling a similar story with a much worse ending. If we're lucky, we're not telling that story to a Congressional committee.

At least this anomaly made today's sequencing easy. John copies yesterday's sequences, cuts the parts that completed already, prepends a command to clear the error flag, and goes to lunch. There's a little more to it than that, though (isn't there always?). In one way, we're lucky that the error killed the sequence, because one of the moves in yesterday's sequence would have left the MB hovering a couple of centimeters off the rock surface, which would have made its data puzzling at best, and perhaps entirely useless. I fix it.

I also write up the detailed story of the anomaly and send it out to the rover drivers, mission managers, and a few others. Bob Bonitz, who's still on the rover-driver email list, comes downstairs to laugh about the matter. Since he's not on the project any more, he can freely poke fun at our mistakes. "That was like a description of a Keystone Kops movie," he chuckles.[2]

He also tells me a friend of his saw the KCET "Life & Times" segment that featured me and called him to ask, "Who's this Scott Maxwell guy, and what is he doing driving the rover?" Apparently, Bob was also interviewed by an NPR reporter who was here at the same time as the KCET guys, spoiling Bob's plan of never talking to the press throughout the mission. "But I still kind of met my goal, because they mispronounced my name," he says. "They called me 'Bob Buntz.'"

Even though it amuses Bob, I hope we don't have another problem tomorrow. And so does Emily, who looks like she's had a long day. "Will you be here later if there are any problems?" she asks George Chen.

"Yeah," he says. "Why?"

"Have you ever been around Steve Squyres when we lose a sol?"

"Not a pleasant experience?" he asks.

Emily purses her lips. "I just don't want to be around."[3]




[1] This doesn't mean it had been uplinked to the spacecraft, just that it had reached a certain advanced point in our process -- a point at which we'd normally consider the sequence ready to go up.

[2] Fair enough -- it was. But it's interesting to note that mistakes in a complex process always look like that -- they have to, because the mistakes that get through to the end are by definition the ones that just happen to involve screw-up after screw-up. What you don't see are all the errors that get caught because of those multiple gates in the process that normally stop them. (In a sense, this sort of complex system is constantly in a state of partial failure.) As a result, if you look only at the mistakes, the people involved look stupid and/or incompetent, but that's an inaccurate view. Application of this idea to, say, high-profile government security failures is painful but sobering.

A while back, fellow rover driver Paolo Bellutta pointed out that we're missing something by focusing our attention almost exclusively on mistakes we don't catch. We should also pay more attention to mistakes we do catch, he said, and try to formalize the way we catch them, because otherwise we'll someday let one of those slip through. I thought this was a brilliant insight on his part and have tried my best to apply it.

[3] I don't know what was behind this comment -- and maybe I don't want to know -- but I can attest that I've never seen Steve be anything less than scrupulously professional. Maybe she'd just had a long day.

No comments: