How do camera sensors work?This question is not about sensor size (light wavelength) or the quality of the glass.

The question is about what happens between the light entering the sensors and getting that data out of the sensor hardware. Namely:

– if a camera is 8M pixel, and assuming 8 bit per colours 8M*3=22.89MB, it would mean almost 23MB of data.

– if it’s capturing at 1080 and 30fps, it would mean 1920*1080*3*30=178MB/sec

– if it’s a system like the GoPro, that captures the whole sensor (12MP) and then downscales to 1080, how can it capture 1GB/sec of data?

– is there any hardware inside the sensor that does downscaling (e.g. reducing the 1GB to the 178MB), or providing the RGB to YUV, so it will transmit 10 or 12 bits instead of the 24 bit per colour?

– who exactly does the sweeping of each pixel? Is there a double buffer on the sensor, or is it the host doing it? This question is about the effect that I don’t know exactly the english name, but the one that happens when one is taking a photo on a moving car, on the window, and the picture becomes diagonal. The point is that the faster something sweeps the sensor, the less this bad effect happens.
The overall objective is to understand the quality of a sensor within a whole product, and be able to factually distinguish a good photo from a less good one, not only by the sensor size, the glass quality, or the CPU capacity to compress the JPEG or MPEG video, but just the bandwidth between the pixels and before the buffer on the main CPU and memory.
Update: if possible, comparing the three (for me) main groups of cameras – smartphones (ARM power), dedicated ARM devices like the GoPro, and fully dedicated hardware like DSLRs

Dave Haynie, Electrical engineer and part-time mad scientist

Written 31 Oct 2014

Ok… I’m starting a bit lower level than some of the others. A camera sensor is composed of electronic light sensitive elements of some kind. There are a bunch of different light sensitive components that have been invented over the years, but modern sensors pretty much all use photodiodes. 
A diode is a component that normally works like a one-way switch — electricity can go through it one way, but not the other. But there are a bunch of specialized diodes that use the basic properties in novel ways.. some can act as voltage references, some can put out light (LEDs… I’m lighting my house with these), some can act like variable capacitors. And some will pass current proportionally to the number of photons hitting the sensor. In fact, for every photon that hits a photodiode, one electron is transferred. 
So now that you have the photodiode, you make a sensor by building a big array of these photodiodes. If you dig into camera documentation, you can usually find the size of a “pixel”, which isn’t really the size of a pixel at all, but the size of a photosite. The photosite contains the photodiode, and that’s the only photo sensitive element, but it may also contain other stuff, like the electronics that drive the photodiodes. So like a human eye, you may have some stuff in the way, reducing the light actually reaching the photodiode. More recently, there are “back side illuminated” sensors, which put the photodiode on the back and the electronics on the front, more like the superior eye of the octopus. But either way, now you have an array of photo diodes… the sensor itself. 
Of course, that doesn’t do anything yet. There’s support electronics to power and manage all those photodiodes. At some point, there will be a programmable gain amplifier, like a volume control, that boosts the signal from the photodiode as necessary. The feeds an analog to digital converter. Most ADCs in cameras these days produce a 12 or 14-bit output. 
Now, let’s look at photodiodes themselves… they don’t know a thing about color, and they usually respond to a pretty wide spectrum. If you wanted a monochrome camera, you might be ok with the sensor as is, maybe with an infrared filter on it to prevent IR light from being recorded. But you want color. So in most sensors, there’s an array of color filters, one per photodiode… you’ve seen the discussion of Bayer sensors in this article already. Some companies use “non-Bayer” sensors, in that they’re using a matrix that’s other than RGBG… Canon used CMYG in some early cameras. Fujitsu and Sony have used some alternate color configurations. Panasonic has a technology that’s using micro-color-splitters rather than color filters, to avoid loss. See, a perfect spectral filter will cut out 66% of the light going to each photodiode. To avoid this, professional video cameras with small sensors actually use three sensors and a diachroic prism, to split light into separate R, G, and B beams to separate R, G, and B sensors. Along with the filter, most sensors also have a micro-lens array, to focus light on the photodiode and hopefully get around the fact that the photo sensor isn’t the whole photo site. 
Ok, so there’s a sensor. It’s controlled by a microprocessor, which can control when it’s collecting light, when it’s not, etc. Some cameras use that function to set the exposure, but more professional models use a separate electronic shutter to make the exposure. It’s possible you’ll actually get two exposures for every one you take. Like all electronics, photodiodes have a “dark current”… basically, there’s energy in the system due to heat… that’s also where noise in a low-light photo comes from. The camera may shoot a totally dark photo, then the one you’re after with light and everything, and use that first to establish “zero” for each pixel. 
Once exposed, the microprocessor will read the sensor’s data. Older CCDs were basically a gigantic bucket brigade, where only one pixel was read at a time, but these days, most sensors have multiple access lines and can read very fast. The data that comes directly off the sensor, with all that weird filtering that may be very specific to only that one camera… that’s what the camera will record in a raw file. And as I mentioned, pixels are usually 12 or 14 bits of resolution… that’s one reason professionals like raw… not only is it not compressed like a JPEG, but it has absolutely all of the information the camera recorded. 
As you pointed out, that may be a great deal of data, especially if you’re doing video. But the processors and interfaces are designed specifically for that kind of data. If I have a 20Mpixel camera, I’m probably reading 30-35 MB per photo from the sensor. And if I can shoot 10fps, that’s 300-350MB/s that I’d have to store, writing raw files. That might not actually happen. Now of course, 300MB/s is nothing for a fast interface… PCI Express runs at 10Gb/s these days (over 1GB/s) per link (four wires… you could use this in an embedded product… in fact, I do, though nothing quite as integrated as a modern camera). But when you shoot a photo, it’s getting some compression (lossless with raw, way more with JPEG), and then it’s going into a large, fast RAM buffer in the camera, queuing up for a write to flash memory. 
And the process is identical for every camera. Sure, some don’t shoot stills, some don’t shoot video, some don’t record raw, same may only record raw, but the things that happen are the same for GoPro as for a DSLR as for a P&S as for a cinema cameras. 
Now, of course, if I shoot video at, say, 1080p60, that’s quite a bit of data, too. But not as much as you think. Even with that 20Mpixel sensor, I’m never going to store 20Mpixel. Some cameras can “line skip”… the only need to read out part of an image. Then, the on-camera microprocessor (which is also a photo/video image processor) will use a combination of software and dedicated hardware to crunch that image to the 2Mpixel you need for video, then compress it for a write to your flash card: to Motion JPEG, to AVC-Intra (kind of the MJPEG of the early 21rst century), or full IPB AVC, or something else. At that point, the video is shrunk substantially, to 25, 50, 100Mb/s or so. SD cards are fast enough. 
Let’s take a full frame image, a 20Mpixel shot, and see what happens when you make it a JPEG. Ok, so as mentioned, you don’t have 20,000,000 RGB pixels, you have a raw image with 5,000,000 red, 10,000,000 green, and 5,000,000 blue pixels. You know why we do this… photodiodes don’t see color. But why don’t we usually care? Because of the human eye… we have 120,000,000 or so sensors in the eye that see luminance, and only about 6,000,000 that see color. So all kinds of tricks get played with color you mere mortals every day. 
So this raw image — in the camera’s really fast RAM, is converted to an RGB image in a process sometimes called “de-Bayering”. In simple terms, you’re interpolating. Let’s take a red pixel… you know it’s red value. But you’ll also find it’s surrounded by four green pixels (horizontally, vertically) and four blue pixels (on the diagonals). There’s a pretty good chance that the green at your current pixel is similar to all of its neighbors… so you interpolate… take the average of the neighbors colors. And thus get your G and B colors to make a full color pixel. 
That could now be a 36 or 42 bit pixel, but heading for video or JPEG, we’re now going to knock this down to 24-bits per pixel. So that’s done.. but then, while we’re at it, we’ll mess with color even more. Rather than store an RGB pixel, the image processor will convert to a color space called YCrCb, which is a lossless transform from RGB color space. Now that we’re in YCrCb color space, we’re going to toss out 3/4 of the chroma samples. Basically, that means that we’re going to keep 20,000,000 Y samples, but . This is called chroma subsampling, and both JPEG and the most common video compressions employ 4:2:0 subsampling, if you’d like to learn more. What this means is that per pixel line, we’re only storing 1/2 as many samples, and either Cr or Cb per line, not both. So that translates to only 5,000,000 Cr and 5,000,000 Cb samples per shot. So without actually getting to JPEG, we took a 60MB RGB image and converted it to a 30MB YCrCb 4:2:0 image. 
Next comes the actual JPEG compression. JPEG breaks the image up into 8×8 blocks of pixels. Then it runs a reversible transform called the Discrete Cosine Transform, which takes the spatial data of each 64-pixel cell and converts it to frequency information. Various rule are then employed to cut out some of the higher frequency information from each cell. After all cells are so reduced, the result is Huffmann encoded, a lossless compression. It’s the filtering of the high frequency information and the 4:2:0 subsampling that makes JPEG lossy. And you’ve probably come across different “strengths” of JPEG… smaller files or “finer” image. That controls just how the high frequency information is discarded. 
For video, an AVC-Intra video is basically the same idea, only employing AVC innovations like variable block size, more modern lossless compression, etc. But it’s literally a bunch of single JPEG-like frames. 
When you go to IPB AVC, that’s a different story. “I” means “Independent” frame, which is again going to a JPEG-like still image. This is also called a key frame. The encoder makes an I-Frame, then it make another I-Frame. Only, it doesn’t encode that I-Frame… it does a bunch of sophisticated analysis of the difference between the two frames, using motion estimation algoriths.. the goal being that a small set of motion vectors, applied to that first I-Frame image, should produce something very similar to the second one. So that’s done, and a difference frame is made between the real second frame and the one calculated from the first frame. That’s the “error”, usually lots of black with very little bits of ghosty pixel… that compresses extremely well. So that second frame becomes a “P” frame, for “Predictive”, which encodes a compacts set of vectors and a very small error frame, also stored as a very compressed JPEG-like things. If you’re able to analyze many frames at once, you can also have “B” frames, for “Bidirectional”… a B-Frame can be predicted from both the proceeding and following frame. 
In old fashioned MPEG-2, you had one I-Frame for every 15 total frames. In AVC, it’s technically possible to have hundreds of P or B frames between every I frame, but that would be a pretty unusual situation. 
Well, probably enough to chew on right now.



Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s