On Mar 25, 2014, at 7:48 PM, Dan Smith - dsmith@danplanet.com wrote:
Chirp does not modify the checksum in the .img file after editing and saving. The checksum is recomputed on the fly on upload, so it's not like the radio will see a bad checksum as a result,
Unless of course, the cable is bad... :)
Good point, and I'd overlooked that, even having ordered a new cable. But read below; the symptoms have gotten way stranger than that.
If it _was_ right out of the radio, then that tells me that the data got corrupted between the radio and the computer (i.e. in the cable, USB adapter, etc). Sounds like we should make CHIRP check the checksum after a clone for at least that radio.
...or the cable is silently corrupting data on the way in and this is the first time it has corrupted something that mattered.
If I were you, I'd modify the driver to check the checksum after download, and then do a bunch of downloads of a good radio with the original cable and see if they occasionally don't match.
Excellent idea, and I will do that. It will be awhile until there's any information on that since I don't have an FT-60 at the moment. No word yet from Yaesu wrt the radio I shipped them Monday.
I might also implement an optional check for a couple of bits in the image being set. See below.
============ I've continued to look at the "bad checksum" files in my collection, and there's a very interesting pattern. No conclusion yet, but quite a bit of information:
- Of the 21 image files with bad checksums (out of 248), 16 were created in one consecutive period spanning Feb 7-8. There were no images downloaded in that period that do not have bad checksums. The image that bricked my two radios on March 22 is the 10th in that sequence of 16.
- The other 5 images with bad checksums do look like files I had edited, and that's more in line with the number of such I remember.
- Of those 16 files, all 16 have the same difference between computed and actual checksum, 0x30.
- Examining the diffs between the first 'bad' file and the 'good' file that immediately preceeded it, there are a few differences I recognize as related to the radio state I was examining. There is one difference that I do not: The two bits 0x30 in byte 56 are set. I've pretty well mapped all the feature settings and a lot of nonvolitile radio operating state; these two bits are still "unknown" in my map, I hadn't seen them set before this.
- These two bits correspond to the checksum discrepancy. As if the radio computed the checksum thinking they were 0.
- None of the other 232 image files in my collection have either of these bits set.
I don't know what happened between 15:32 and 15:53 on Feb 7 that caused this. Looking at the file names involved, I don't believe I did an upload, but I can't be positive of that almost two months later.
Similarly, I don't know what happened between 17:40 and 20:22 on Feb 8 to clear this up. My calendar and email history offer no clues. Break for dinner is probably all. Diffing the last 'bad' file and the first 'good' file at 20:22, the only differences are the couple of expected bits for the settings I was mapping, and the "bits of death" are cleared. So I didn't upload an operational image for some reason, such as to use the radio.
========== I'm struggling to fit the observations into a sensible model of how this happens. I'm willing to believe a bad cable caused this somehow, maybe injecting the two spurious bits on an upload and also miraculously making the checksum match so the radio accepted it one time. I want to say I don't recall any uploads failing, which statistically you'd expect some of if this is your scenario, but there might have been a couple I took as operator error and just retried.
I have much more trouble seeing how a faulty download causes this. A flakey serial interface does not corrupt only two consecutive bits, in the same position, out of ~229,000 bits in a serial stream, 16 times in a row.
I'm starting to think an internal glitch on the first radio isn't out of the question as the root cause, now that there's a (still very fuzzy) plausible model for how it resulted in killing the second radio.
For the radio to return these two bits in the same position on every one of 16 consecutive downloads, I expect they're actually stored in the flash. If I were designing this, I can see a few possible models for how the checksum is computed and used:
1) Don't use it internally (!), just compute it on the fly when doing a clone write (chirp's download). But then it wouldn't be repeatably wrong on chirp downloads the way it is.
2) Check it internally at boot, and update it on every power down by summing all of flash. But then a bad checksum won't survive power-off.
3) Check it internally at boot, but never actually sum memory on write, except maybe on a clone Rx. Update the checksum on every byte write as a difference. I.e.: Read old byte X, subtract from new byte X, write new byte X, read old checksum, add (newX - oldX), write new checksum. Not as robust, but saves reading all 32KB every power-off. But again, the error we're seeing would cause a bad checksum error on the next boot, wouldn't it?
4) Compute it as in (3) but don't check it, use it only to send along with the data on a clone write. But we have to read all the flash anyway to send it out the serial port, so why not compute it then, instead of complicating every eeprom byte write? I reject this as too brain-damaged to be real.
So I haven't yet hit on a model for the radio's use of the checksum that explains the observations. Any ideas?
I also note that the radio booted up at least 15 times Feb 7-8 with these two bits apparently set. (Or 31 of you caunt the clone mode power-on to do the downloads).
Why didn't it brick at that time? Why wait until they were uploaded as set, then power off/on, on Feb 22? What's the difference?
-dan