Archive for the ‘Work’ Category

Find the cause of a Vista blue screen

Friday, May 16th, 2008 by Steve

I finally managed to get to the bottom of my vista blue screen problem, so I thought I’d share how I determined which driver was causing the problems.

Vista keeps a log of application and kernel crashes in Control Panel -> Problems Reports and Solutions -> View problem history:

Vista problem reports

Double clicking on the latest Windows “shut down unexpectedly” shows the blue screen details. These don’t give much useful information, for example which driver was responsible:

Problem report detail

Clicking on “View a temporary copy of these files” opens an explorer window with the crash dump file, which you can copy to your own directory.

To analyse the crash dump you’ll need to install the Microsoft Windows Debugging Tools (17MB msi).  This adds a whole set of command line tools under “C:\Program Files\Debugging Tools for Windows (x86)”.  Use the dumpchk.exe tool to analyse the crash file:

Start examining the crash dump

Crash dump analysis result

And there’s the culprit: “Probably caused by: eacfilt.sys”.  This is the driver used by Nortel’s Contivity VPN client.  I’m using the “vista friendly” version, which worked fine before I applied Vista SP1, but I guess SP1 broke its driver.  The solution to all my problems? Uninstall it!

Uninstalling Nortel\'s Contivity VPN client

Hurrah! My T61’s suspend and hibernate work again!

Vista SP1 blue screen resuming from hibernate or suspend

Saturday, April 12th, 2008 by Steve

Since installing service pack 1 on Vista, my shiny new laptop (Thinkpad T61) has a problem coming out of a hibernated or suspended state. When resuming from hibernation or suspend it’ll give me the BAD_POOL_CALLER error (and automatically reboot) roughly 50% of the time. It’s so bad I’ve stopped using hibernate and suspend entirely.

I found a solution on the lenovo forum, apparently the T61’s UPEK fingerprint reader driver 1.9.2.99 can be responsible. I’ve installed version 1.9.2.111 (download directly from UPEK), but I still get blue screens if I hibernate.

Other drivers known to be incompatible with SP1 are listed on Microsoft KB 948343, but I’m pretty sure I’m not running any of them. Any ideas?

Update (16th May 2008): The problem turned out to be Nortel’s Contivity VPN client.  They don’t appear to have released an updated version since SP1 was released.  I no longer have a need for this VPN client, so I simply uninstalled it.  Problem solved!

While I was trying to get to the bottom of this I read many suggestions.  Dodgy memory seems to be a common cause, and this can be checked by booting memcheck and leaving for a few hours.

Protecting against SSH brute-force password attacks

Sunday, January 27th, 2008 by Steve

I run an internet facing ssh server, so my logs are regularly full of brute-force password attacks like this:

Jan 20 02:59:21 drevil sshd[12803]: error: PAM: Authentication failure for illegal user root from 213.136.100.86
Jan 20 02:59:24 drevil sshd[12806]: error: PAM: Authentication failure for illegal user root from 213.136.100.86
Jan 20 02:59:27 drevil sshd[12816]: error: PAM: Authentication failure for illegal user root from 213.136.100.86
Jan 20 02:59:30 drevil sshd[12820]: error: PAM: Authentication failure for illegal user root from 213.136.100.86
Jan 20 02:59:34 drevil sshd[12827]: error: PAM: Authentication failure for illegal user root from 213.136.100.86
Jan 20 02:59:37 drevil sshd[12830]: error: PAM: Authentication failure for illegal user root from 213.136.100.86
Jan 20 02:59:40 drevil sshd[12833]: error: PAM: Authentication failure for illegal user root from 213.136.100.86
Jan 20 02:59:44 drevil sshd[12836]: error: PAM: Authentication failure for illegal user root from 213.136.100.86
Jan 20 02:59:47 drevil sshd[12840]: error: PAM: Authentication failure for illegal user root from 213.136.100.86
Jan 20 02:59:51 drevil sshd[12843]: error: PAM: Authentication failure for illegal user root from 213.136.100.86

There are several simple ways of reducing the chance of a break-in through this method:

1. Use strong passwords

This is an obvious place to start. The vast majority of these attacks come from automated scanning tools. These attempt to log in using passwords from a commonly used “dictionary”, so avoid simple words like “password”. Using a combination of letters, lower and upper case letters, and even symbols (!”£$%^&*) will give a password that is unlikely to be listed in a “common passwords” dictionary.

2. Restrict the users who can connect via ssh

OpenSSH has the capability to specify a “white list” of allowed users and deny all others. Simply add this line to your /etc/sshd_config and restart the sshd service:

AllowUsers dave mike sarah

This will block attempts to connect as any of the common system users (root, postfix, mysql etc), EVEN if the attacker guesses the correct password. If this list is kept as small as possible, it is much easier to verify these users have strong passwords.

3. Rate limit new ssh connections

A simple iptables script can be used to rate limit new incoming connection attempts. There are two ways of doing this, using the limit and recent iptables modules. Here’s the limit solution:

iptables -N NEW_SSH
iptables -A INPUT -p tcp –dport 22 -m state –state NEW -j NEW_SSH
iptables -A NEW_SSH -s 10.0.0.0/24 -j ACCEPT
iptables -A NEW_SSH -m limit –limit 3/min –limit-burst 3 -j ACCEPT
iptables -A NEW_SSH -j DROP

The third line ensures that connections from the internal network (in this example 10.0.0.0/24) are not subject to rate-limiting. The weakness of this approach is that while an attack is underway, ALL new ssh connections from outside are blocked. The recent module allows a slightly different approach (taken from debian administration):

iptables -N NEW_SSH
iptables -A INPUT -p tcp –dport 22 -m state –state NEW -j NEW_SSH
iptables -A NEW_SSH -s 10.0.0.0/24 -j ACCEPT
iptables -A NEW_SSH -m recent –set
iptables -A NEW_SSH -m recent –update –seconds 60 –hitcount 4 -j DROP
iptables -A NEW_SSH -j ACCEPT

This module “blacklists” IP addresses that exceed the rate limit, while still allowing other IP addresses to connect. If a connection makes it past this rate limiting, we accept it (last line).

4. Run your ssh server on a different port

The automated scanners look for ssh services on the default port (22), so if you move your sshd to a non-standard port less scanners will find you. It’s worth noting that this approach doesn’t improve security at all against a determined attacker. Personally I don’t use this technique, my SSH servers run on port 22.

Dell D630 display options

Friday, January 18th, 2008 by Steve

I’m trying to buy a new laptop from Dell, but they aren’t making it easy for me! I’ve got everything prepared: a great broadband service, budget for the laptop etc. All I need is for Dell to let me place an order for the actual laptop specification I want!

My old laptop is a Dell D600, so I’m looking at the equivalent D630. When I bought the D600 there were two display options: XGA (1024×768) or SXGA+ (1400×1050). I went with the higher resolution option, and it’s been fantastic.

Reading the product pages, the D630 also has two options: WXGA (1280×800) or WXGA+ (1440×900). I can live with the slightly lower widescreen resolution of 1440×900, but 1280×800 is just too much of a step down.

Unfortunately, this display option is missing from the UK “customise and buy your laptop” section. Only one option is listed, and it’s the low-res one:

Dell D630 display options UK

A visit to the Dell USA website shows the option exists over there:

Dell D630 display options USA

I don’t really want the hassle of ordering a laptop over there, getting it shipped over here, replacing the USA keyboard with a UK one…

Ah well, there must be plenty of other laptop manufacturers who WILL give me a high-res screen…

Windows Server 2003 DNS serial number problems

Sunday, January 6th, 2008 by Steve

I’ve been having a recurring problem with my Windows small business server 2003. Sometimes when I reboot it, it decrements the serial number of one of its DNS zones. This causes repeated warnings to be logged on a Linux slave DNS server:

Dec 3 06:53:49 drevil named[2765]: zone 20.0.10.in-addr.arpa/IN: serial number (61) received from master 10.0.20.10#53 < ours (62)
Dec 3 07:03:48 drevil named[2765]: zone 20.0.10.in-addr.arpa/IN: serial number (61) received from master 10.0.20.10#53 < ours (62)
Dec 3 07:11:26 drevil named[2765]: zone 20.0.10.in-addr.arpa/IN: serial number (61) received from master 10.0.20.10#53 < ours (62)
Dec 3 07:21:24 drevil named[2765]: zone 20.0.10.in-addr.arpa/IN: serial number (61) received from master 10.0.20.10#53 < ours (62)
Dec 3 07:29:18 drevil named[2765]: zone 20.0.10.in-addr.arpa/IN: serial number (61) received from master 10.0.20.10#53 < ours (62)
Dec 3 07:37:54 drevil named[2765]: zone 20.0.10.in-addr.arpa/IN: serial number (61) received from master 10.0.20.10#53 < ours (62)
Dec 3 07:47:10 drevil named[2765]: zone 20.0.10.in-addr.arpa/IN: serial number (61) received from master 10.0.20.10#53 < ours (62)
Dec 3 07:56:11 drevil named[2765]: zone 20.0.10.in-addr.arpa/IN: serial number (61) received from master 10.0.20.10#53 < ours (62)

The solution is simple: Log onto the windows server, open the DNS management console, find the zone and click “increment” a couple of times on the serial number (SOA). But it’s very annoying, especially when the damn thing reboots itself every month for patch Tuesday!

It seems this was a documented problem in Windows Server 2000 (fixed in SP4): http://support.microsoft.com/kb/304653, but I can’t find any reference to the same problem in Server 2003.

How much memory is in my Linux system?

Sunday, November 4th, 2007 by Steve

I came across a really handy tool for listing the number of RAM sockets you have, and what’s currently in them all. The tool is dmidecode, and it’s installed by default on Debian Etch:

drevil:~# dmidecode -t memory
# dmidecode 2.8
SMBIOS 2.3 present.

Handle 0×1000, DMI type 16, 15 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: None
Maximum Capacity: 4 GB
Error Information Handle: Not Provided
Number Of Devices: 2

Handle 0×1100, DMI type 17, 23 bytes
Memory Device
Array Handle: 0×1000
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: 256 MB
Form Factor: DIMM
Set: None
Locator: DIMM_1
Bank Locator: Not Specified
Type: SDRAM
Type Detail: Synchronous
Speed: 333 MHz (3.0 ns)

Handle 0×1101, DMI type 17, 23 bytes
Memory Device
Array Handle: 0×1000
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: 256 MB
Form Factor: DIMM
Set: None
Locator: DIMM_2
Bank Locator: Not Specified
Type: SDRAM
Type Detail: Synchronous
Speed: 333 MHz (3.0 ns)

Thanks to MJ Ray and Stuart Langridge, hopefully this will save me getting the screwdriver out in future!

Transparent webcaching on a Cisco 877

Sunday, October 28th, 2007 by Steve

After hours fighting with WCCP, i’ve given up and implemented the simpler solution: policy-based routing.

WCCP is a cisco protocol for managing web caches. It’s really quite slick, as it only forwards requests to the cache(s) when they are alive (and sending “i am here” messages to the router). If the cache service fails, the router passes web requests directly through. WCCP also automatically handles some of the fiddlier configuration, such as not mangling requests from the cache itself. Unfortunately I couldn’t get it to work.

My Cisco 877 is currently running the latest 12.4T IOS (12.4(15)T1). Some of the web guides I found suggested “known working” versions of 12.3 or 12.4 mainline IOS, but only 12.4T versions of IOS are available for the 877. This leaves a lot of variables, I might open a TAC case and get Cisco on the job.

Policy-based routing works, but it doesn’t gracefully handle cache failure like WCCP. On the Cisco 877:

no ip cef

access-list 101 deny tcp host 10.0.20.1 any eq www
access-list 101 permit tcp any any eq www

route-map proxy-redir permit 10
match ip address 101
set ip next-hop 10.0.20.1

interface Vlan1
ip policy route-map proxy-redir
ip route-cache policy

Where 10.0.20.1 is the IP address of the squid webcache.

This only works when I turn off CEF (no ip cef). When CEF is enabled, the first packet of the TCP connection (SYN) is forwarded from router to webcache, the webcache replies directly to the client (SYN|ACK), but the third packet from client (ACK) does not get forwarded by the router to the webcache. All connections time out.

When the policy-based routing is process switched the forwarding works as expected. All packets arrive at the webcache and the caching is transparent as expected. Fast-switched policy-based routing (ip route-cache policy) also works, which is an improvement on process-based, but the optimal solution would be CEF-based. I have a Cisco TAC case open to investigate this.

On the Linux 2.6 (debian Etch) squid server:

# Disable rp_filters
echo 0 > /proc/sys/net/ipv4/conf/default/rp_filter
echo 0 > /proc/sys/net/ipv4/conf/all/rp_filter
echo 0 > /proc/sys/net/ipv4/conf/lo/rp_filter
echo 0 > /proc/sys/net/ipv4/conf/eth0/rp_filter

# transparent webcaching
iptables -t nat -A PREROUTING -s 10.0.20.0/24 -d ! 10.0.20.0/24 -p tcp –dport 80 -j DNAT –to-destination 10.0.20.1:3128

10.0.20.0/24 is the subnet to cache, and 10.0.20.1 is the IP address of the webcache.

That’s it. HTTP requests are transparently forwarded to the squid server and cached.

I found these resources helpful when trying to get WCCP working:

And these were useful for policy based routing:

Upgrading Linux software RAID-1 array

Wednesday, October 24th, 2007 by Steve

I just finished upgrading my Debian Etch fileserver from 2×200GB IDE disks to 2×500GB SATA disks. I managed to keep the server running for nearly the entire time, by failing and hot-adding disks to the RAID-1 arrays. If I had room in the case for more than two disks it would have been even easier.

Here is the configuration BEFORE:

  • /dev/hda partitioned into hda1 (10GB), hda2 (1GB), hda3 (175GB)
  • /dev/hdc partitioned into hdc1 (10GB), hdc2 (1GB), hdc3 (175GB)
  • RAID-1 array md0 composed of hda1 and hdc1, mounted as /
  • RAID-1 array md1 composed of hda2 and hdc2, mounted as swap
  • RAID-1 array md2 composed of hda3 and hdc3, mounted as /home

I started the ball rolling by failing one partition from each RAID array:

mdadm –fail /dev/md0 /dev/hdc1
mdadm –fail /dev/md1 /dev/hdc2
mdadm –fail /dev/md2 /dev/hdc3

Then I powered down the server, disconnected and removed hdc and added a new 500GB SATA disk to the SATA PCI card. It booted up fine with all three RAID arrays degraded. I used fdisk to partition the new SATA disk (/dev/sda) with identical sized partitions 1 and 2, and with the third partition taking up the remainder of the disk. I set all partition types to fd (linux raid auto-detect):

  • sda1 (10GB), sda2 (1GB), sda3 (454GB)

Then one at a time I hot-added these partitions to the running RAID arrays. This causes a background reconstruction, so it’s worth waiting for each to finish before starting the next:

mdadm –add /dev/md0 /dev/sda1
mdadm –add /dev/md1 /dev/sda2
mdadm –add /dev/md2 /dev/sda3

When all three were completely synced (cat /proc/mdstat to see the progress), I edited /etc/mdadm/mdadm.conf to change all references from /dev/hdcx to /dev/sdax. I then re-built the initramfs so it knew how to start the arrays at boot time:

update-initramfs -k all -c -t

I then powered down the server again, removed the last IDE disk (hda) and added the second SATA disk (sdb). At this point the system is unbootable, so I started from a rescue CD (actually the Debian Etch netinst cd, starting with the “rescue” boot option). Once I got a command prompt (Alt-F2 and Alt-F3 virtual consoles), I installed grub:

mount /dev/sda1 /mnt
chroot /mnt /bin/bash
nano /boot/grub/device.map

I edited the device.map so it looked like this:

(hd0) /dev/sda
(hd1) /dev/sdb

Then installed grub on the first SATA disk:

grub-install /dev/sda

I rebooted and grub succesfully booted the server. As expected, all RAID arrays were in degraded mode. I used fdisk to re-partition the second SATA disk to match the first, then hot-added the mirrors to the RAID arrays (waiting for each re-sync to complete before starting the next):

mdadm –add /dev/md0 /dev/sdb1
mdadm –add /dev/md1 /dev/sdb2
mdadm –add /dev/md2 /dev/sdb3

Then I edited /etc/mdadm/mdadm.conf to update the partitions to (for example) sda1,sdb1. I re-build the initramfs again as above, and rebooted to test everything booted up cleanly. Also I checked /proc/mdstat after the reboot to check all arrays were fully functional.

So now the new disks are installed, but there’s no extra storage available because the ext3 partition is still set to the old size! I rebooted into single user mode, unmounted /home, then used resize2fs to expand the filesystem to use the whole partition:

e2fsck -f /dev/md2
resize2fs /dev/md2

One reboot later and voila, 455GB usable in /home.

Then I followed this guide to install grub on the second RAID disk in a bootable way.

Software development resources

Tuesday, October 16th, 2007 by Steve

Here are some fantastic blog posts and resources I’ve come across in the last week:

Nokia Mail For Exchange on N73

Sunday, October 7th, 2007 by Steve

Nokia have released an updated version of their Mail For Exchange application which officially supports the N73. The 1.5 version explicitly blocked installation on N73, while the older 1.3 release worked but was unsupported.

If you run a Microsoft Exchange server, this is a must-have application as it enables seamless sync of contacts and calendar as well as full push email. And it’s completely free.

The e-series blog and Darla Mack have more information on the release and its new features, and you can download it here: http://businesssoftware.nokia.com/mail_for_exchange_downloads.php