DirectShow and USB: Finding the right capture filters

When capturing from USB devices via DirectShow, video and audio are processed mostly by two separate filters. The reason why two filters are necessary instead of one, is the USB class model (there exist one video class and one audio class) and the accordant driver interfaces that are provided by the operating system. However, DirectShow does not provide a function or method that would return the two filters for a capture device.

But with some know-how it’s possible to write such a function by ourselves. Today, I want to show you how it works. For this article I assume that you’ve already worked with DirectShow for some time, so I won’t start at the very beginning.

1. Enumerating the video capture filters

First, we create a list of all available video capture filters using IEnumMoniker and CLSID_VideoInputDeviceCategory.

While enumerating we save the “display name” of each IMoniker. By removing the prefix “@device:pnp:” we get the device path that could be used in CreateFile for example.

2. Enumerating the audio capture filters

Similarly, we do the same with audio capture filters using CLSID_AudioInputCategory.

We remove the prefix “@device:cm:{33D9A762-90C8-11D0-BD43-00A0C911CE86}\” from the “display name”. But this does not give us the device path, yet. We will get it in the next step.

3. Enumerating the waveInxxx devices

All audio capture devices are now enumerated using the waveInxxx API. The number of devices can be retrieved through the waveInGetNumDevs function.

The device’s name can be found in the structure member WAVEINCAPS.szPname when calling waveInGetDevCaps. The associated audio capture filter can be found by comparing the strings.

Now we can retrieve the device path for each matching filter. For this, we call waveInMessage passing the messages DRV_QUERYDEVICEINTERFACESIZE and DRV_QUERYDEVICEINTERFACE.

4. Retrieving the device instance

All that’s left to do is to get the device instance for each device path found (audio and video). We will use the setup API for this:

First, we get a device information list via SetupDiGetClassDevs. For video we use the category KSCATEGORY_CAPTURE and for audio we use KSCATEGORY_AUDIO.

Going through the category’s available device interfaces via SetupDiEnumDeviceInterfaces, we can retrieve detailed information for an interface by calling SetupDiGetDeviceInterfaceDetail.

If the path in SP_DEVICE_INTERFACE_DETAIL_DATA.DevicePath matches with one of our saved paths, then we can find the device instance in SP_DEVINFO_DATA.devInst.

5. Find the parent device instance

The last step is to find the parent device instance of each device instance found. The parent device instance can be retrieved by calling the CM_Get_Parent function from the Configuration Manager API.

If a video capture filter an an audio capture filter have the same parent, then we’ve found the pair of filters for a device! That’s it!

It sounds very complex but it isn’t. Attached you will find sample code which is shorter than this article!

Download:
CaptureDevice.zip

“Simple” and precise calculation of Euler’s number e

I announced it in the very beginning of this blog and today it has come true: The first article about mathematics!

I’d like to start with an algorithm, that can calculate Euler’s number e in a very simple way on a computer. The only required operations are integer addition and bit shifting.

The formula:
A simple calculation needs a simple formula and since computers like best to work with binary numbers, the formula should be designed for that.

It’s not necessary to search for such a formula. It’s enough to modify the formula we know from school a bit:

By using 2n instead of n, for any n we get a binary fraction as base and thus a number the computer likes.

The dimensions:
The base fraction needs 1 bit before the decimal point and n bits after.

Since we already know that e has a 2 before the decimal point, the result will need 2 bits for that.

The number of bits behind the decimal point will double each time we square. And since we have to square n times the total number of bits can be calculated with the following formula:

With n=10 we need about 10 kbits, with n=20 21 Mbits, with n=30 32 Gbits, with n=35 1.2 TBits and with n=58 (the maximum value when using 64-bit addressing) lousy 16.7 Ebits (Exabits).

The algorithm:
Squaring binary numbers is easy. But let’s take a look at the general case first:

We see, that each addend is squared and additionally all addends are multiplied pairwise. The latter operation is performed twice when expanding, thus the 2 on each term.

Doing the same on a binary number, we’ll get for example

From that, we can deduce two operations for squaring binary numbers:

1. Is bit n set to 1 in the input value, then set bit 2n to 1 in the output value.
2. Are both bits m and n of bit pair (m,n), m<n, set to 1 in the input value, then set bit m+n+1 to 1 in the output value.

“Setting the bit to 1″ implies setting with carry. If the bit is already set to 1 then it is set to 0 and the following bit is set to 1.

With that, we can describe the e calculation algorithm in pseudocode as follows:

void euler(int n)
{
  Binary input;
  Binary output = 0;

  output[0] = 1;
  output[n] = 1;

  for (int q = 0; q < n; ++q) {
    input = output;
    output = 0;

    for (int i = 0; i < input.length; ++i) {
      if (input[i]) {
        output.setBit(2 * i);

        for (j = i + 1; j < input.length; ++j) {
          if (input[j])
            output.setBit(i + j + 1);
        }
      }
    }
  }
}

Well, that's it. You don't need more than simple arithmetic to calculate e.

“vbr-new” even faster

In version 1.0.0.16 I’ve optimized again the “vbr-new” algorithm in “fpmp3enc”:

In the previous version, although the quantization phase was added to the Fiber Pool as a task with high priority, the task and its sub tasks were not executed until other tasks had finished their work or blocked.

With this behaviour the scheduler threads entered the “Stop and Go” phase very soon, where they are started and stopped in very short intervals. This caused a high performance loss. (This is also the reason why the performance is lower with 4 than with 3 threads.)

In the current version the tasks with normal priority now communicate with the Fiber Pool to check if tasks with high priority are ready. If so, they are interrupted and a task with high priority is started.

The threads enter the “Stop and Go” phase here, too, but later, because due to the new load balancing the threads can be used more efficiently. But it’s still not recommended to use more than 3 threads.

The previous version had already a speedup of 2.14. With the performed changes a speedup of 2.74 is achieved.

More is not possible for the moment… ;-)

Multicore MP3 Encoder: Development finished

Good news: The development of the MP3 encoder is finished.

Details, information and benchmarks can be found on the new web site.

Have fun!

Multicore MP3 Encoder: Asynchronous MDCT

Unfortunately, the progress of the encoder’s development is not as fast as I expected, however some small progress is recognizable.

The current version that can be downloaded here now executes the MDCT stage asynchronously.

Since MDCT is computed very fast compared to the other stages, the performance gain is not particularly high. It was about 8% on my system.

The port of the serial code to asynchronous code was done in multiple steps which I want to describe here:

Original situation:
The following figure shows the MDCT stage when running the serial code:

MDCT 1

First, the psycho acoustics (PSY) are computed from the incoming samples and afterwards the MDCT. The quantizer is started right after the MDCT computation.

The data dependencies are as follows: MDCT depends on PSY because the block types (long/short) are computed there. The quantizer depends on the MDCT results.

Step 1: Creating the MDCT task
First, all actions that have to do with MDCT are put into a separate task. This gives us a good starting point for the next parallelization steps:

MDCT 2

The main change is the movement of the MDCT state (the subband samples) from the global data structure to the task.

Although the task is executed in parallel, the used synchronization primitives allow serial execution only.

Step 2: Overlapping calculation of the subband samples
The previous figure shows that the calculation of the subband samples does not depend on PSY but only on the incoming samples. So additionally, the MDCT task synchronizes with them:

MDCT 3

Now, the subband samples of a frame can be calculated in parallel to PSY.

Step 3: Overlapping with the quantizer
The MDCT computes four “xr” arrays, two for each array. Since the quantizer does not need all four arrays at once it can be started right after the first are is computed. The remaining three arrays are computed in parallel:

MDCT 4

Step 4: Parallel execution of the MDCT
For each channel, the MDCT can be computed independently. So, the task is split into two sub-tasks:

MDCT 5

Step 5: Further overlapping with PSY
PSY computes the block types in two phases. Right after the first phase, it’s possible to compute the first two “xr” arrays:

MDCT 6

Step 6: Removing the MDCT container task
The MDCT task can be removed now because all work is done in the sub-tasks. Now, the sub-tasks access the block types and samples directly:

MDCT 7

CU

Multicore MP3 Encoder: Intermediate version released

An intermediate version of “fpMP3Enc” is now available which supports parallel MP3 encoding of multiple files (use ‘*’ as argument).

On a quad-core system the speedup of multiple file encoding was 3.75x (= 94% efficiency) compared to single file encoding. The LAME speeds were 89.2x and 23.8x.

LAME itself has a speedup of about 2 (multi: 48.5x, single: 24.6x). The reason for this poor performance is the uncontrolled, concurrent file access of the four running processes.

I’ve reduced the task communication by replacing the data queues with SequentialVirtualMemoryPipe classes. The following picture shows the current data flow:

Intermediate version 1.5

The next version will take a while, since I have to analyze first how to process frames in parallel.

Multicore MP3 Encoder: First milestone reached

Sorry for the delay of the English version…

OK, the first milestone has been reached and is now available for download. The sample’s name is “fpMP3Enc”.

As I wrote before, I’ve implemented the reading of the WAV file, the encoding and the writing of the MP3 file as separate tasks.
The communicate with each other through data queues, as shown below:

The performance was measured for the first time on an Intel Q9450, Vista x64 using 61 WAV files with a total duration of about five hours (2.99 GiB).
The result was a “speed-up” of 22.8x for “fpMP3Enc” compared to 25.6x for LAME 3.98.2.

For the next milestone I will reimplement the LAME code between lame_encode_buffer_interleaved and lame_encode_mp3_frame in a separate task. Maybe this will increase the performance a bit…

Development of a multicore MP3 encoder started

The next case study for “Fiber Pool” will be a multicore-capable MP3 encoder. The code will be based on LAME 3.98.2 and step by step it will be replaced by code using “Fiber Pool”.

Since the whole project will take some time, I will give updates here after reaching a milestone. So visit this blog regularly!

Here is the first milestone:
First, the application similar to “lame.exe” must be written to get a starting point for performance comparisons. It will have the following three tasks: the WAV file reading task, the MP3 writing task and the LAME encoder task.
In theory, we could keep three threads busy with that. But, since only the MP3 encoder needs CPU time, we won’t get any parallelization in this milestone.

About the performance:
Although performance will take a main part in this study, for now, it will NOT be comparable with a LAME binary built with an Intel compiler. This will be possible in a later milestone which will have SSE optimizations.

Creating ZIP archives

Fiber Pool 1.0.0.7 is now available for download.

After the performance comparison of my “inflate” implementation remained unnoticed, I sat down and also integrated the “deflate” algorithm for compressing.

You can take a look at the product homepage to see how it works with Fiber Pool.

Besides, the compression algorithm “deflate” was not modified!

Now, the obligatory benchmarks:

The test system was configured as follows:
- Intel Q9450, Win Vista Home Premium 64-Bit, 8GiB RAM, no Superfetch
- Drive C: Samsung HD502IJ (SATA)
- Drive E: Seagate ST31500341AS (SATA)
- Drive F: Western Digital MyBook (USB)

Test 1: Compress the “Microsoft SDK” folder on the same drive (Drive C: 1,5 GiB/15.400 files).

Compress with…
…fpzip x64: 244s
…7z 4.65 x64: 322s (7z a -tzip -r …)
…WinRAR 3.90 Beta 1 x64: 379s (winrar a -afzip -r …)

7z is slower by 32%, WinRAR by 55%.

Test 2: Compressing MP3 files on the same drive (Drive E: 1,95 GiB/286 files).

Compress with…
…fpzip: 53s
…7z: 134s
…WinRAR: 157s

7z is slower by 153%, WinRAR by 196%.

Test 3: Compressing image files from F: to E: (1,14 GiB/937 files).

Compress with…
…fpzip: 57s/45s*
…7z: 59s
…WinRAR: 75s

* The file I/O scheduler used two threads in the second test.

While the time of 7z is about the same WinRAR is about 32% slower.

Test 4: Compressing “Microsoft SDK” from C: to E:.

Compress with…
…fpzip: 144s/140s*
…7z: 181s
…WinRAR: 183s

Here both competitors are about 25% slower.

Test 5: Compressing MP3 files from E: to C:.

Compress with…
…fpzip: 52s/49s*
…7z: 54s
…WinRAR: 135s

WinRAR is about 160% slower.

Conclusion:
In my tests WinRAR has been revealed as a lame duck. The main reason seems to be that the whole process is performed on one single thread. Another reason is the synchronous I/O for reading and writing. Certainly, the memory use is minimal (about 30 MiB), but do you want this if you have 8 GiB?
The “deflate” implementation itself is really fast: On a weaker test system with one HT processor WinRAR was the fastest although it run on only one thread.

7z uses multithreading and therefore it is much faster than WinRAR in some tests. I haven’t investigated yet, what the threads do in detail.
Looking at tests 2 and 5 you can see that the good processing speed is decelerated by the synchronous I/O (test 2).

fpzip not only uses multithreading but also uses file-spanning asynchronous I/O. The Fiber Pool’s memory management classes control the memory utilization.
The team play between processor cores, I/O and memory becomes visible in test 2: Because of the long search for duplicates the drive is faster than the compression algorithm in this case and hence, it can read more files asynchronously into system memory. With more files more cores can be utilized for compression.

Note: fpzip is just a sample application which is not free of bugs and which has not implemented the full PKZIP specification. There are many things that can be made better and the optimum has not been reached yet. If there’s somebody there who would like to continue the development of fpzip, he/she can contact me.

Welcome to the English version of my blog!

Hi!

This is the English version of my German blog.

I will try to keep my posts up-to-date for both languages.

Have fun!