• Interview on NetBSD 4

    I'm happy to have been part of the "Waving the flag: NetBSD developers speak about version 4.0" interview.  Enjoy! [Continue reading]

  • BenQ RMA adventures, part 1

    A couple of weeks ago, I called BenQ's RMA service to ask for a fix for my new FP241W Z. I have problems with the HDMI digital input: the monitor crops part of the image on each side and makes it slightly bigger to fill all the screen. It turns out that there is a firmware upgrade for this specific monitor that adds a configuration option to turn off overscan, which effectively should resolve this problem. So I called them to get the firmware in my monitor updated.The operator was very polite and helpful. After asking for the details of the monitor and the problem I was having, she confirmed that, effectively, the problem I was describing was due to an outdated firmware and that they'd be fixing it for free. I gave the necessary data and then —after 20-something minutes— they told me that a carrier would come to my location and pick the monitor up for delivery to the technical center. But first, such carrier had to contact me to set up a date for the hand out.Two weeks after, no one had called me nor sent me any email. In fact, they told me that I should be receiving an email, so I simply waited patiently because, when it comes to email, things can be very slow. I even thought that I might have deleted that message as spam.  But two weeks was already too much.So yesterday's afternoon, I called the RMA service again and explained the situation. They promised to fix it that evening. Effectively, they did. They called me this morning asking for another detail about the monitor, so I assumed things were being dealt with.And this afternoon, at around 15.30, I received a funny SMS from the carrier that had to pick the monitor up. They basically said "We will come to pick up the package on the 30th (aka today). It will be between 9.00 to 12.00 or between 12.00 to 19.00.".  Great!  Could it be more inaccurate please?  First of all, the mentioned times cover practically all day. Second, most of that time had already been skipped. And third, I was getting the notice the same day they were coming. I immediately rang home and asked my mother if somebody had gone there. As they had not shown up yet, I returned home quickly, packed the monitor up in a hurry and soon after, the carrier reached home.  Just a matter of luck I was able to deal with this on time!This isn't really a critique; I just want to explain how the process is going as things progress. The BenQ service has been very responsive and polite until now, and I think I can only blame the carrier service.  Let's hope things move fast and well from now on. [Continue reading]

  • A request to virtualization software developers

    Here is a request for a feature I have not yet seen in any virtualization application — used Parallels Desktop 2, VMware Fusion 1.1 and another product I can't speak of yet — that I'd love to have.  It'd make things so much easier for me...  So here is an open request just in case one of the developers of free alternatives (e.g. VirtualBox) reads it and decides to get ahead of the competence by implementing it.Before explaining my feature request, let's consider you have a server on your network on which you run multiple virtual machines (VMs) for whatever purpose. These machines are exported to the network using bridged networking so that other computers in the network can access them transparently as if they were physical computers. To make this setup trivial, you have a DHCP server on your network that hands out static IP addresses to these virtual servers, and you also have a DNS server that maps these addresses to static names. This way, users on your network can access the virtual machines by simply spelling out their host names.Now let's move to the laptop world where you are connected to different networks all the time (e.g. at home or at work) or no network at all. Here I will assume that you will want to access the VMs exclusively from your laptop. In this case, you should not use bridged networking because you'd be exporting all your virtual machines to the possibly untrusted network. And you cannot rely on the external DHCP nor DNS servers to deal with static IP addresses nor host names for you because in many situations you have no control over them.Your best bet is to use shared networking to configure your VMs (or host-only networking if they needn't access the outside world).  But if you do so, your VMs will get random IP addresses because you have no control over the DHCP sever bundled into the virtualization application. And as a result of this, you cannot assign host names to them. As a workaround, you can manually configure each operating system running on a VM to have a static IP (bypassing DHCP), then add an entry in the host's /etc/hosts file to assign a host name to the guest OS and at last add an entry in the guest's /etc/hosts file to assign a host name to the host OS. Which is painful.In my idea world, the virtualization applications could have the ability to fine-tune the bundled DHCP server to hand out specific addresses to the virtual machines (VMs) and a way to specify DNS host names for them, all from the configuration interface and without having to touch any configuration file in the host system (nor guest, for that matter). E.g. add a little configuration box for the IP address and host name of the guest OS alongside the box that already exists to configure the MAC address. Then have the bundled DHCP server hand out the appropriate entries to the guests, add an entry to the host's /etc/hosts and provide a virtual DNS server to the guests so that they can resolve each other's names.An use case for this? I have two VMs that I carry around in my MacBook Pro that I use very frequently and that I do not want to expose to the outside network at all. One is a Fedora 8 installation and the other a NetBSD one. I start them up from the graphical interface and then access them through SSH exclusively. But in order to reliably use SSH, I need to do the above manual steps to set up a host name for them, or otherwise using SSH is a pain.I am also trying to set up an automatic build farm for ATF (composed probably of 10-15 VMs) and the need to set all these details manually is extremely boring. [Continue reading]

  • Testing the process-tree killing algorithm

    Now that you know the procedure to kill a process tree, I can explain how the automated tests for this feature work. In fact, writing the tests is what was harder due to all the race conditions that popped up and due to my rusty knowledge of tree algorithms.Basically, the testing procedure works like this:Spawn a complete tree of processes based on a configurable degree D and height H.Make each child tell the root process its PID so that the root process can have a list of all its children, be them direct or indirect, for control purposes.Wait until all children have reported their PID and are ready to be killed.Execute the kill-tree algorithm on the root process.Wait until the children have died.Check that none of the PIDs gathered in point 2 are still alive (which could be, but reparented to init(8) if they were not properly killed). If some are, the recursive kill failed.The tricky parts were 3 and 5.In point 3, we have to wait until all children have been spawned. Doing so for direct children is easy because we spawned them, but indirect ones are a bit more difficult. What I do is create a pipe for each of the children that will be spawned (because given D and H I can know how many nodes there will be) and then each child uses the appropriate pipe to report its PID to the parent when it has finished initialization and thus is ready to be safely killed. The parent then just reads from all the pipes and gets all the PIDs.But what do I mean with safely killed? Preliminary versions of the code just ran through the children's code and then exited, leaving them in zombie status. This worked in some situations but broke in others. I had to change this to block all children in a wait loop and then, when killed, take care to do a correct wait for all of its respective children, if any. This made sure that all children remained valid until the attempt to kill them.In point 5, we have to wait until the direct children have returned so that we can be sure that the signals were delivered and processed before attempting to see if there is any process left. (Yes, if the algorithm fails to kill them we will be stalled at that point.) Given that each children can be safely killed as explained above, this wait will do a recursive wait along all the process tree making sure that everything is cleaned up before we do the final checks for non-killed PIDs.This all sounds very simple and, in fact, looking at the final code it is.  But it certainly was not easy at all to write, basically because the code grew in ugly ways and the algorithms were much more complex than they ought to be. [Continue reading]

  • How to kill a tree of processes

    Yesterday I mentioned the need for a way to kill a tree of processes in order to effectively implement timeouts for test cases. Let's see how the current algorithm in ATF works:The root process is stopped by sending a SIGSTOP to it so that it cannot spawn any new children while being processed.Get the whole list of active processes and filter them to only get those that are direct children of the root process.Iterate over all the direct children and repeat from 1, recursively.Send the real desired signal (typically SIGTERM) to the root process.There are two major caveats in the above algorithm. First, point 2. There is no standard way to get the list of processes of a Unix system, so I have had to code three different implementations so far for this trivial requirement: one for NetBSD's KVM, one for Mac OS X's sysctl kern.proc node and one for Linux's procfs.Then, and the worst one, comes in point 4. Some systems (Linux and Mac OS X so far) do not seem to allow one to send a signal to a stopped process. Well, strictly speaking they allow it, but the second signal seems to be simply ignored whereas under NetBSD the process' execution is resumed and the signal is delivered. I do not know which behavior is right.If we cannot send the signal to the stopped process, we can run into a race condition: we have to wake it up by sending a SIGCONT and then deliver the signal, but in between these events the process may have spawned new children that we are not aware of.Still, being able to send a signal to a stopped process does not completely resolve the race condition. If we are sending a signal that the user can reprogram (such as SIGTERM), that process may fork another one before exiting, and thus we'd not kill this one.  But... well... this is impossible to resolve with the existing kernel APIs as far as I can tell.One solution to this problem is killing a timed-out test by using SIGKILL instead of SIGTERM. SIGKILL could work on any case because means die immediately, without giving a chance to the process to mess with it. Therefore SIGCONT would not be needed in any case &mash;because you can simply kill a stopped process and it will die immediately as expected— and the process would not have a chance to spawn any more children after it had been stopped.Blah, after writing this I wonder why I went with all the complexity of dealing with signals that are not SIGKILL... say over-engineering if you want... [Continue reading]

  • Implementing timeouts for test cases

    One of the pending to-do entries for ATF 0.4 is (was, mostly) the ability to define a timeout for a test case after which it is forcibly terminated.  The idea behind this feature is to prevent broken tests from stalling the whole test suite run, something that is already needed by the factor(6) tests in NetBSD.  Given that I want to release this version past weekend, I decided to work on this instead of delaying it because... you know, this sounds pretty simple, right? Hah!What I did first was to implement this feature for C++ test programs and added tests for it.  So far, so good.  It effectively was easy to do: just program an alarm in the test program driver and, when it fires, kill the subprocess that is executing the current test case. Then log an appropriate error message.The tests for this feature deserve some explanation.  What I do is: program a timeout and then make the test case's body sleep for a period of time.  I try different values for the two timers and if the timeout is smaller than the sleeping period, then the test must fail or otherwise there is a problem.The next step was to implement this in the shell interface, and this is where things got tricky.  I did a quick and dirty implementation, and it seemed to make the same tests I added for the C++ interface pass.  However, when running the bootstrap testsuite, it got stalled at the cleanup part.  Upon further investigation, I noticed that there were quite a lot of sleep(1) processes running when the testsuite was stalled, and killing them explicitly let the process continue.  You probably noticed were the problem was already.When writing a shell program, you are forking and executing external utilities constantly, and sleep(1) is one of them.  It turns out that in my specific test case, the shell interpreter is just waiting for the sleep subprocess to finish (whereas in the C++ version everything happens in a single process).  And, killing a process does not kill its children.  There you go.  My driver was just killing the main process of the test case, but not everything else that was running; hence, it did not die as expected, and things got stalled until the subprocesses also died.Solving this was the fun part. The only effective way to make this work is to kill the test case's main process and, recursively, all of its children. But killing a tree of processes is not an easy thing to do: there is no system interface to do it, there is no portable interface to get a list of children and I'm yet unsure if this can be done without race conditions.  I reserve the explanation of the recursive-kill algorithm I'm using for a future post.After some days of work, I've got this working under Mac OS X and also have got automated tests to ensure that it effectively works (which were the hardest part by far).  But as I foresaw, it fails miserably under NetBSD: the build was broken, which was easy to fix, but now it also fails at runtime, something that I have not diagnosed yet. Aah, the joys of Unix... [Continue reading]

  • Got a BenQ FP241W Z flat panel

    As I already mentioned, I was interested in buying a 24" widescreen monitor for both my laptop and PlayStation 3. I considered many different options but, based on my requirements (1920x1200, 1:1 pixel mapping, dual HDMI/DVI-D inputs), I ended up choosing the BenQ FP241W Z (yeah, did it again).This thing is gorgeous as the following photos will show you. Lots of real screen state to work — the ability to have many different, non-overlapping editors and terminals open at once is very convenient — and great to watch videos. But it has a "small" problem (I want it fixed!) that I'll explain after them...So here are two photos of the MacBook Pro working in clamshell mode, connected to the new monitor:And here are a couple of images showing the PlayStation 3 in action:OK, this last image is the one I wanted to discuss. It is showing the "PlayStation Store", accessible directly from an option in the XMB interface. It is easy to see that the image is cropped on the four sides: some letters are cut, and the top and bottom buttons are shown extremely close to the screen border's. This is not what I expected.Even more, booting Linux reports that the framebuffer's dimensions are 1688x964 even though the screen says it is working in 1080p mode (1920x1080). If I force Linux to go to full 1080p, then the terminal is also cropped on the four sides, making it unusable. According to this thread, this is caused by the monitor assuming that the HDMI input has overscan hence it crops the image. (Note that the image is being slightly scaled up to fill the whole screen, because the visible area is smaller than the displayed one! And I certainly don't want that.)It looks like that a firmware update released on May 2007 adds an Overscan tunable option on the settings, which allows you to disable this feature and thus get the whole image. But unfortunately my monitor was manufactured on April 2007, so it has the old firmware. Grr. Will call BenQ support tomorrow and see if they can do anything about it (I guess they'll be able to do a firmware upgrade, but they may need to take the monitor for several days^Wweeks.). Otherwise I may end up returning this unit. Heck, I searched 1:1 pixel mapping like crazy and now I find this other, unexpected problem. No way.Other than that, great display. Now, if only I had a Mac Pro to accompany it... ;-) [Continue reading]

  • 24' widescreen comparison

    As promised in the previous post Choosing a 24" widescreen monitor, here comes the brief analysis I did before deciding which monitor to buy. Refer to the comparison table (or the PDF version if the XHTML one does not work for you) for more details. I'm linking this externally because putting it here, in this width-limited page, would be unsuitable.The data in that table has been taken from the official vendor pages when possible, even though they failed to list some of the details. I tried to look for the missing ones around the network and came up with, I think, fairly trustable data. But of course some of them may be wrong.By the way, be specially careful when comparing the Contrast ratio and Response time fields. Each vendor likes to advertise these in different ways, so you cannot really compare them without knowing what each value really means (and I don't, because they generally don't specify it).Anyway, even the table is not complete (some fields are marked with N/A because I could not easily came up with an answer), I hope it will be useful to some of you. [Continue reading]

  • Interferences in CVS tagging

    Once again, CVS shows its weaknesses. Last night I committed a fix to pkgsrc and soon after I noticed I had a prior e-mail by Alistair, a member of PMC and the one responsible for the preparation of pkgsrc releases, asking developers to stop committing to the tree because he was going to tag it for pkgsrc-2007Q4. It turns out that my fix did not get into the branch because the directory it went in (devel/monotone) had already been tagged. Had I committed the fix to, say, x11/zenity, it would have gone into the branch. Or worse, had I committed a fix that spanned multiple files, some of them would have got to the branch and others not.So what, am I supposed to read e-mail before I can do a commit? What if the mail does not arrive on time? What if the commit had affected many more directories and some of them had already been tagged but some not?This is just another example of CVS showing its limitations and stupidities. Given that each file's history is stored independently — i.e. there are no global changesets — the only way to tag the repository is to go file by file and set the tag on each. And then, you need to check which revision of each file is the one to be tagged. I do not know why is this so slow even when you do a rtag (so the one doing the work is the server alone) on HEAD, but in the case of pkgsrc this process took more than 2 hours!OK, OK, I'm hiding the truth. The thing is there are some ways around this: for example, using the tag command will tag the exact revisions you have in your working copy, or passing a date to rtag will tag the repository based on the provided timestamp. This way you ensure that the tagging process will be consistent even if people keep committing changes to the tree. However, the first of these commands will require a lot of network communication and the second will put a lot of stress on the server, making the command even slower (or that's what I've been told).In virtually all other version control systems that support changesets, a tag is just a name for a given revision identifier. And defining this tag is a trivial and quick process. Well, Subversion is rather different because tags are just copies of the tree, but I think that they deal with these efficiently. [Continue reading]

  • Welcome, 2008

    It is a new year again.Let's see if I can, at least, accomplish one goal: I should try to not delay stuff as much as I have been doing until now. This specially refers to replying to some e-mails and working on some stuff I once started but have not had the time to finish (bad excuses, I know). The clearest example that comes to my mind is Boost.Process, for which I have got many status-requests already... but there are also some tiny pet projects such as genfb support for NetBSD/mac68k and witheouts for tmpfs. Of course, there also is the conversion of more NetBSD tests to ATF.But, and this is a big but, the first semester of the year will probably keep me extremely busy with my Ph.D. courses... and, to make things worse, when I get home in the evening I'm so tired that I don't want to do more work. Will have to try to organize tasks a bit better so that there is time for everything.Anyway, happy new year to everyone! And thanks to your continuous visits and support, this is the 400th post :-) [Continue reading]