![]() |
Anatomy of a hang
It's another one of those thread titles that sounds dirty but isn't.
Like every PC ever known, my current desktop PC is going through a painful period. It's experiencing intermittent hangs. Freezes. I'm an experienced problem-solver; I did it for a living for a while. But I have no idea why mine is having this problem, so I figured I'd list what I did, and what I do, to try to figure this one out. The symptom: well, usually it happens in Firefox. I'm opening up tabs, or scrolling, or doing something, when the system stops being responsive. The cursor may go to an hourglass in the Firefox window. For about 10 seconds the system may be able to do other things; I may be able to hit ctrl-alt-del to get to the windows task manager, or I may be able to switch windows, or I may be able to highlight a desktop icon. Or I may be able to do none of this. I think I may usually be doing memory-intensive operations when it happens. I don't think it will happen right now, for example; all I'm doing is typing into the edit box, so I'm thinking it'll stay stable during this period. Just in case, I'll save this post :) It's the "symptoms" part, and next I'll say what I think so far. |
The hang, as far as I know, has only happened when Firefox has been active. But it has happened when Firefox was open and Thunderbird was executed - T-bird being a hugely memory-intensive thing, moreso than Firefox.
As a problem solver, my first thought is: what has *changed*? Is there anything I can point to in the last few days that might make a difference? Hardware: a big yes. In fact, I moved my entire room around last week. At that point, a day later, my sound card failed. Coincidence? Who can say, but a failing speaker system might have led to a shorted output, blowing a channel, so I had to replace the whole thing. These days it is hard to get a mid-priced sound card in stores, so I wound up getting an external: a Creative USB solution. But the system had been running correctly with this in place for a week, and was rebooted several times during that period. Software: yes. Two days previously, I had run a "startup cleaner" and turned several things off during startup. This was partly because I was annoyed that Real wanted to remind me to upgrade, those fuckers. Partly because I had been cleaning Jacquelita's system of malware, and I was getting paranoid. I was able to reboot cleanly and run well after that. But shortly thereafter I started to get errors with a hardware monitoring program that watches my motherboard's temperature, fan, and voltages. It would not run at startup, although when I ran it after startup it seemed to run fine. |
OK, so does the motherboard monitoring software say anything interesting? It says my CPU is so cool I could touch it, my fans are running fine, and the power supply is producing good voltages -- while the monitor is watching it -- except for maybe a few little hitches on the 12V rail. I write that off to disk usage.
Although I should say, when the system hangs, the disk activity light seems to be at 100%. Is this problem hardware or software? Of course, it only seems to happen when software is running; the system has remained stable overnight, when it's relatively inactive. So my first thought was that it must be Firefox. The problem started to really get aggravating yesterday, and only appeared in Firefox; Unreal Tournament 2004 worked for an hour, and that'll tax your memory and your whole system harder than Firefox. But was that just chance? I disabled smooth scrolling, since it seemed to happen during scrolling. No effect. I disabled a few extensions that I didn't need, and uninstalled a few that I never used. Often Firefox extensions will make it seem like Firefox itself is unstable. But this had no effect on the problem. |
OK, seems like a memory problem then. So I've taken the 1GB out of my other system, and replaced the 768MB currently in this one. A complete memory upgrade: more memory, faster memory, and identical sticks.
The problem still happens. And that's where it stands now. Next, I'll do the IotD, which will take a few memory-intensive, system-intensive things (maybe even Photoshop), and if this morning has been any example, the whole thing will just freeze up at some point in the middle of all that. It just seems more hardware-ish. The motherboard is an Asus, whatever the most common Socket A AMD-supporting board was a year and a half ago. The capacitors are not bulging. Your thoughts, anyone? |
yeah right
|
Have you done a complete disk check?
|
Is the problem that you're using a PC?
|
Disk check = OK
PC = yes, but if this problem is figured out I can fix it myself for as little as $50, or reuse components to build another for $200, so the point is moot. |
I don't know much about hardware or software but I have some guesses:
1. Firefox has a memory leak; and/or 2. Firefox inadvertently creates multiple references to the same memory address; and/or 3. Firefox creates partially overlapping references to the same memory block ; and/or 4. Firefox has some unmapped cases in its algorithms. I've installed and uninstalled Firefox quite a number of times on several computers and have decided that while I like the program a lot, it needs a few more minutes in the oven. Clarification on point #1. I don't necessarily think its a true memory leak because the task list does not reflect excessive memory allocation to Firefox. But the way it starts behaving makes me wonder if somehow, Firefox thinks all the memory is tied up and therefore unavailable leading it to freeze which somehow cascades into a general PC lockup. I dunno but I'm past tired of trying to get that program to act right. |
If you're not having problems in any other program, then I'd have to suspect Firefox also.
|
I've used Firefox since 0.7. And, somehow someway, I've been using it for the last two hours without a problem. I've opened and closed 50 tabs; my IotD search method involves opening 25 tabs simultaneously.
I think it shows up in Firefox because Firefox is 90% of what I do. When I can get to the task manager during the hang, it reports Firefox as taking about 100,000 k when the problem seems to happen. There was one odd thing, though... during that disk check, the D: drive seemed to have a moment where it was louder and grindy-er. It got through the check OK, but I'm gonna back that up now and check it harder. There is space allocated on that drive for swap... or whatever Windows calls it... |
Quote:
|
Yes, I've moved all virtual memory off that drive, and I'm now moving all the data off that drive. Because of various incorrect business decisions, I happen to have about 6 80GB drives, so upgrading the drive anyway is not a bad idea.
(If anyone wants one of these Seagate Barracude 80GB drives, let me know. Unused. $50 plus shipping. PM me.) |
I had the same problem, but could never tell if it was memory, disk, or motherboard. Does your HDD access light go solid when you freeze up as well?
Because it was so infrequent and never resulted in a OS or App dump, I could never diagnose it. Troubleshooting it by swapping hardware would have proven too expensive. |
Yes, it goes solid... where I have the system, under the desk, I don't notice the light so much. But every time I've checked, during the hang, that light is 100% on.
|
Quote:
I'm tempted to blame the on-board IDE controller, just because of the HDD access light, but who knows? Let me know if you want any of the hardware manufacturer information or model numbers. Maybe there is something in common. |
Moving files from D: to C:, the system got really slow during one set of files, and hung during one particular file.
When I returned to explore that folder, the system hung again while I was just browsing the suspect directory. I didn't even open any of the files. Luckily I don't need any of those particular files. I've been able to move just about everything else off of that partition. Sadly there is another 20GB partition to move before the entire disk can be swapped out. But we have a key suspect in the hangs. Sadly we still don't even know whether it's hardware or software at it's root, right? It's either bad sectors on the drive, the handling of which is causing Windows internals to completely barf, which really shouldn't happen, or an NTFS filesystem problem which Windows' own disk check failed to find. Not looking good for Mr. Gates. Updates to follow. |
Quote:
This reminds me of an issue that took me months to diagnose on some servers at work. Check out this IBM system hang from hell: Quote:
|
Quote:
Meanwhile, what numbers are you using for 'good voltage'? What are you using (what program) for testing NTFS filesystem? Did you download the hardware diagnostic for that disk drive and execute only 'read only' tests? Hardware test is independent of the NTFS filesystem test - sometimes provided useful information. One final point. Confirm that the BIOS setting still agree with what the drive actually is. I have seen where BIOS refused to see a drive properly - slowly destroyed disk data structures as NTFS kept fixing them. What are you using for file transfers? Most copy programs have an option to ignore errors - to complete the file transfer. Few hardware items that can hang an pre-emptive MT system. They include memory, CPU, only some functions in the peripheral interface, and the video controller. A disk drive with internal problems should not hang an MT system. It should only hang the task. A list of usual suspects. Start with the volt meter. Don't use the motherboard monitor until after calibrated with that meter. |
Well I hadn't done a sector-level check yet; I wanted to just get the data off ASAP, which is now done (after yet another hang). It's doing the sector-level check right now.
Still don't even know 100% that it's the disk -- that's what's frustrating about diagnosis like this. But that's part of why I wrote it here - I figured, hey other people might enjoy the drama of watching someone else's guessing session. |
Quote:
|
I've now measured it with the multimeter.
Motherboard monitoring program says 12v+ is 12.41 Multimeter says 12.26 Motherboard monitoring program says 5v+ is 5.06 Multimeter says 5.11 The monitoring program says the 12v+ dips just slightly, to 12.35. (Microsoft's) sector-level check now complete: checks out OK. |
Quote:
Based upon those numbers, set alarm points for the voltage monitor to 11.86 or 11.9 (for 12 volt) and to 4.9 volts (for 5 volt). Doubt it will provide any further information. But disk drive manufacturer's test program for that disk also could be used for seek tests, various multisector access tests, and other things that Microsoft program does not accomplish. Do this if short on ideas. |
My 2cents . A multimeter might, but my son and I had a bad disagreament about a power supply. The meter showed it good. But a PS tester said no go.
About 15 bucks. POWMAX atx power tester. |
I agree with Buster, here. I've had some weird PS shit happen, and a new cheap one installed fixed my problems.
|
Quote:
A best test of a power supply is to take numbers while multitasking is accessing every peripheral - disks, floppy, CD-Rom, network, sound card - simultaneously. Anything done by a power supply tester can be performed by the meter. Also are power supply defects that a tester cannot detect; but meter can. The power supply tester cannot test a power supply under full load - when many defects become apparent. Then there is the rest of a power supply 'system'. It’s not just the power supply that must be tested. This also accomplished without disconnecting anything. The down side of a meter is that these tricks must be understood. For example, what voltage would you have called 'good'? Best way to test a power supply is when connected to system. Never start by disconnecting things until long after relevant facts have been collected. Power supply tester cannot do that. Just another reason why a meter finds problems or confirms power supply integrity so much faster. Unfortunately, too many declared a 'subject' good rather than provide those numbers. Those numbers - such as UT's numbers - tell us more about the system that has not been discussed. This is why those other voltage numbers (not yet provided) might be informative. |
Also, check out your northbridge fan. I had similar intermittant failure on my fileserver -- turns out the nb fan was choking, and data was getting corrupted on the way to the drive... I lost two years' worth of email. :(
Oh, yeah -- now I back it all up three ways. :) |
Did it? For example, in a GM car, they kept replacing the computer. There was nothing wrong with the computer. But computer was replaced rather than first learn WHY failure was happening. Swapping was only temporarily cleaning a defective connector. Car would fail again later.
Same lessons are from Challenger. Management insisted that it was safe to launch because a shuttle safely launched one year previously. They ignored the near burn through of O rings in that one year previous flight. They did not want to know why. A perfect example of fixing things without first learning the whys. In that case, we should have called Challenger murder. Instead we destroyed the career of the engineer who told the truth to the Roger's commission. Instead too many insist they need not know why - if it appears to work. In a third case, a GM shop foreman finally got tired of same GM model (Buick) with similar problems. So he broke open the computer. In each failure, the PC board was cracked in a corner. Regional rep then told him this is a known problem even though it was not in any service bulletin. Since the test facility was not informed of this problem, then vehicle computers tested OK and were shipped as repair parts. At GM, because reasons why were not important, then failure was acceptable. Numerous examples that also explain why I see this so often with clone computer users. They get used to having failure as a norm. It is the difference between just swapping parts to fix something - curing symptoms - verses fixing something right the first time - learning why. |
Quote:
That is the Home Improvement joke. Fix things with "more power". If a fan is required on that Northbridge, then the Northbridge IC is defective. One chassis fan is more than sufficient cooling for most every computer. That one chassis fan will provide sufficient airflow over any Northbridge. But again, that Northbridge must work just fine when so warm as to be uncomfortable to touch. Learned this of the old timers who used to say in the 1960 - if it does not leave skin, then it is not too hot. Today, our IC must run normal at even higher temperatures. A Northbridge fan suggests Northbridge IC is defective. Someone cured the symptom rather than fix the problem. Heat is a diagnostic tool. |
So far so good - the new drive is in, partitioned, formatted and is now getting all the data from the old drive, and -- no hangs in hours.
The outgoing drive is an IBM Deskstar 60GB - not considered the most reliable of drives. It has a manufacture date of Jun 2001 so it has seen enough duty and can be retired. That N-bridge fan, Pie, is a pet peeve of mine in motherboard designs. I can't believe manufacturers decided the best way to handle that problem was to put a dinky, weak fan right where all the dust in the system will flow and clog it up. This MB has a heat sink there, much better idea. |
Keeps us posted, UT. I'd be interested in how the new disk fairs, as it might be the solution to my shuttle issue.
|
Quote:
|
Quote:
|
I didn't need those particular files, so I just deleted them by deleting their parent directory from an explorer window. It could have just been coincidence that the hang happened when "revisiting" those particular files.
Even now, if this fixes the problem, it's hard to figure out what really happened (and not worth the time to diagnose more completely). It was probably the drive failing, but still, Windows should fail more gracefully when faced with a resource that's having trouble. It surprised me when the system hung even when I took virtual memory duties away from that drive. I could see a failing drive causing an OS a headache when it's swapping to it, but when doing more "routine" I/O, just reading files or folders, it shouldn't just lose its place that badly. Maybe the drive was failing harder and drawing too much power in spikes, and thus causing other hardware problems? |
Quote:
|
Quote:
Yes, meters do not necessarily report RMS voltage: they lie. But that is what makes many meters so good at identifying bad power supplies. Again, note numbers provided because of how meters typically work. I am more than just a tech. We designed power supplies even in the 1970s. Have even demonstrated on a system that was intermittent - the supply was not providing power as claimed. System would boot and mostly work. And then we put a meter on it. Quite obvious that a clone power supply could not service the load - even though the owner insisted supply was replaced and now working. Meter demonstrated otherwise. Been doing this stuff for too many decades. I prefer an oscilloscope because it says faster what I want to learn. But the meter is how field problems are identified or eliminated quickly as a suspect. A $15 tester, among other things, does not provide a sufficient load for testing. It can declare a power supply bad but it cannot declare a power supply as good. BTW, one final point. Notice that tester did not get hot and did not contain fans. Fans would be required if tester sufficiently loaded a power supply. Just another reason why power supply is best tested (and tested faster) still inside the computer. Just another example of why 'learning why' makes those meters a so superior solution. BTW, do you still have a VTVM? I have a wee bit of knowledge and experience. |
Spoke too soon, it just hung again.
|
Quote:
Disk drive computer talks to motherboard computer using a fixed set of command - similar to how networking works. There is nothing electrical in a disk drive that would hang a computer. Except when a computer is not so resilient - booting. Have never seen a disk drive hang any NT system except during boot. During simplistic boot programs, the software may sit waiting for a response forever - a hang. Have seen tasks hang due to a disk drive problem. Have seen NT slow to a crawl due to a bad disk. But never had an NT system lock up so that Task Manager would not operate - except when Task Manager could not load from that drive. Marginal conditions can occur on disk hardware causing a drive's computer to not respond or reply to commands. It is why software designed to test hardware (ie from IBM) is so much better at testing disk hardware; rather than software designed to test Windows interface to hardware (Microsoft). This being only background information - when that next drive fails. Meanwhile a drive failure should have been recorded in Microsoft's event (system) log. Find it using HELP. Also the drive hardware (an IBM creation) would have data to indicate ongoing failures. Forgot what they call that function - smart something. Just another reason why IBM hardware test software could have been more useful - I believe it is now a Toshiba product. |
Whaddya know, the event log.
I should have known about that! I take it back about Mr. Gates. The event log has numerous bad block errors listed for drive D, even after the drive has been replaced. Therefore these errors are probably not actual errors, but a failing controller thinking they ARE errors. |
Quote:
As noted earlier, heat is a diagnostic tool. Drive D is the original offending drive? Well, it may have bad drivers/receivers on its computer board. It might cause motherboard computer to not communicate with a C: drive computer. IDE bus is a network cable where each computer - drive computer from each disk and the motherboard computer all share time talking on that cable. Therefore problem could be slave drive computer, master drive computer, south bridge IC on motherboard, etc. This is what the hairdryer does. To make intermittents more frequent by applying heat. Find failures by running parts hotter - then do not fix those parts with more fans. Hairdryer that causes any computer part to fail - that part is 100% defective. And that part will get worse with age. |
cold spray works the other way. If ya think it's hot give it a shot of freeze ass.
Quote:
|
Quote:
So, UT, any verdict? |
Yes, Newegg overnight shipping rocks!
It doesn't make sense to isolate which particular chip is having trouble because A) they're all on the same board, and the fix is the same: replace the whole board; and B) the parts are so close together now, that heating one particular part without heating any other is nearly impossible. At the least it requires that the board be completely de-cased and set up completely differently. So I've ordered a new motherboard from Newegg. It's only $102, so what the hell. The problem is that my old board is too old and they don't sell it any longer. So I had to get a new board. But my processor is pretty old too and I sure would like to get something that supports SATA since I have a big old SATA drive just sitting here. So I decided to get a much newer board with more capabilities. Of course that meant changing out the video card too, because AGP is now out in favor of PCI Express. There's another $150. And of course the processor. Doesn't make sense not to buy a 64-bit processor today; and if you get one with 1MB of cache you get another speed increase, so that makes sense. $215. And well, it turns out that modern motherboards have a new 24-pin power connector. And there's a new feature of video cards called SLI where you can tie two video cards together, if you have the power capability for it; SO I got a new $80 power supply as well. And with the beauty of Newegg overnight and rush processing, it's on a FedEx truck right now, headed my way. By tonight the old problem should be completely gone, unless it's something *really* funky in software. And then, I'll have an entirely new set of problems: making sure all the drivers are in line and updated so the thing runs right with the new hardware. I think, at one point in this whole mess, I complained about people buying a whole new computer to fix their comptuer issues. This will be almost what I will have done. Of course it's mostly out of the urge/need to upgrade anyway, I rationalize. |
Show us the shopping list, in case some of us should get an upgrade attack. Who ever forbid.
|
I'd bet my bippy it is power related. Since the sound card failed around the same time I suspect you got a surge when moving things. Maybe it just juiced the mobo a bit too much or the power supply took a knock. Power supply failures are a total bitch to solve. I had one on an old machine... the only time it froze was when I loaded a page with flash. I have no idea why flash did it, but it did. New computer, same general setup = no problems. Power issues manifest in really really strange ways.
|
All bets are off. I'm running now with all that new gear, but what do I find in the Event Viewer? You guessed it,
Quote:
No HANGS yet, and in the past these errors have logged more often. But with a new controller, new drive, new everything, it is infuriating to see those errors in the log. |
Wow, UT. That sux pond water.
Is it possible the damage was done before the switchover? You might try fixing the disk and have it do a full surface scan. |
Quote:
Maybe this message doesn't refer to the D: drive at all? I think it's referring to the C: drive, which, in theory, is "Hard Disk 0". :smack: This morning I have done a complete chkdsk of the C: drive. It found some problems, though its reporting leaves a lot to be desired. In the end it listed 4KB in bad sectors and did make changes to the filesystem. Next I'll do a complete chkdsk of the E: drive (which is the other partition on that same disk). |
Actually, I think harddisk 0 refers to the physical drive, not the virtual (dos/windows) drive.
|
In order to diagnose the problems in my "A" system, I replaced its old memory with newer memory stolen out of my "B" system.
B memory into A -> A works well A memory into B -> B doesn't boot. So. Add $80 of new matched 1GB Kingston memory to A, and take the 1GB out of A and put it back in B. B memory into B -> B doesn't boot. Fuuuuuuuck Now system B is dual-boot; the IDE drive boots into Windows, the SATA drive boots into Linux (Fedora). Set it to boot into Linux and it does that fine. The BIOS recognizes both the IDE and the SATA. So, maybe the Windows boot record got messed up somehow during all the various rebooting and such. |
Oh yeah, so to finish this whole scenario, I put one of the sticks of A memory into the B system, to boot Linux with 1.5GB instead of 1GB, and it won't boot into Linux! So the whole process has also killed a stick of memory. Will this madness never end?
|
Quote:
Memory test - or what burn-in really is. Don't swap memory. Run a comprehensive memory diagnostic - either one provided by a responsible computer manufacturer or a third party diagnostic such as Memtst86 or Docmem. Execute diagnostic one or two passes. Even bad memory sometimes passes that test. And then we use burn-in - a concept completely misunderstood by those who used English interpretation to assume burn-in means running overnight. Heat memory with a hairdryer on highest setting. A tropical paradise to good memory and hell to bad memory. Bad memory heated above 100 degree F often will expose itself as the pervert it really is. Otherwise move computer outside to 30 degree weather and leave it run the same memory diagnostic for maybe an hour. Accomplishes same thing that busterb discussed with coolant spray. And yes, once I heated the oven to just over 100 degrees, put the clone computer in that oven, and found a defective cache Ram. If memory passes both heat and cold test, then memory is fine. Move on to other suspects. If a memory stick has been damaged, well that is but another reason to not shotgun. I have watched others swap memory because the new memory was defective. They did not use anti-static protection which is especially critical if room humidity is below 40%. Therefore memory that worked just fine on a memory diagnostic at 70 degrees was really defective - maybe static damaged. Just another problem with shotgunning. A problem created by another flawed assumption that parts (once thought to be good) will always be good. Don't swap things. First collect facts. As a result of shotgun diagnostic techniques, you are now spending vast sums of money. And yes, I am also concerned with that Hardrive0 being drive D. Something is wrong - just another fact that should be collected before changing anything. Does your drive have multiple partitions or were you running a master / slave combination as I originally asked? Answer is found in Disk Management program - among other places. Meanwhile, get the comprehensive diagnostic from the hard drive manufacturer. Why? Among other things, because a diagnostic eliminates many unknown variables - ie Windows which is a massive variable. Every test is about stripping a problem down into parts - and then testing those parts - all without physically changing hardware. Don't even look at Windows until hardware diagnostics declare hardware good. And yes, that also means temperature cycling - ie the hair dryer - also called burn-in. |
Quote:
For example, what could be causing all problems? An improperly crimped wire in the disk drive cable. But again, only wild speculation from a long list of possible reasons. The reason why I am answering this is that those who don't know why failure happen then just to the myths promoted by power strip protector vendors. The lights dimmed - therefore it must be a surge. How does a voltage drop become a massive voltage increase. But again, this is how myths are promoted - technical details never learned before declaring a conclusion - or why George Jr could preach that Iraq was a threat to the US - the mythical WMDs. It is why military academies graduate engineers - people who learn why underlying facts and details must first be learned. The claims of 'power related' damage is just too often a myth for too many reasons. Often found where people shotgun rather than first learn facts. Those claims of WMDs - classic example of shotgun reasoning. |
Modern memory can't be heated with a hair dryer, because the DIMMs have aluminum "heat spreaders".
|
Quote:
|
OK, now I'm stuck.
I want to copy bad drive C to good drive D. I want D to be bootable so that after copying, I can just remove C, make D a master, the new C. In this case C and D are pretty much identical drives. In the past, I've done this with Partition Magic. But now, PM refuses to do it because it finds bad sectors on C. I guess I could copy C to an external, put D in place as the new C, install XP to C, and then copy everything from the external to C. But isn't there a better way? |
Might not be a bad idea to do that, anyways. At least if you move everything to an external disk you'll have the ability to do a good format and even use some serious disk sector checking before you move it all back.
|
Urgh. There are some files in /windows that you can't copy while windows is booted, so I can't just drag and drop /windows to D: and I can't copy them to the external.
And you can't recursively copy directories in windows recovery console mode. If I reinstall XP, those files will then be open when the new copy of XP boots, and won't stand for being overwritten by the old copies. Can I copy those files in safe mode? |
Quote:
|
All times are GMT -5. The time now is 08:43 PM. |
Powered by: vBulletin Version 3.8.1
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.