How 'Pic of the Day' Works

Overview

Pic of the Day is a free service that delivers nude/erotic images to your inbox every day via an automated, personalized system. You can rate each pic using a simple five star system, directly from your email. Each email also includes an unsubscribe link, of course.

Subscribe Now

When you rate an image, you'll be brought to your own personalized portal displaying the image you just rated, along with your rating history exposed in several different ways (model, calendar, favorites, etc.). Once collected, the ratings are used to select images that you'll like from the available pool.

Privacy

Just like your email address, your ratings are kept private. Aggregation of individual subscribers' ratings is used in certain cases (particularly for new subscribers), but all such uses are both internal and not tracable to you. I never share any individual's personal information, nor even the specific number of active subscribers. I occasionally make high-level data available (usually via interesting charts on my blog), but this is always both aggregated and summarized.

How Do I Make Money?

I don't. Zero, zip, zilch, nada. Joining is free, there are no ads, and your information is kept strictly private. I do it because the clustering and selection algorithms are fascinating, but only with a sufficiently large data set. By providing this service for free, I can increase the size of my data set without incurring any of the costs associated with "normal" data collection practices. Sometime around the beginning of April, 2008 I crossed the one million record mark and the count continues to climb, without a single cent spent.

The Nitty-Gritty

Building the Collection

Everything starts with the collection of images. I have an automated spider that searches across the Web looking for suitable images. Suitability is determined by a number of factors: source, dimensions, filename, etc. Desirable images are downloaded, stored in Amazon's S3 service, and then added to the collection of available images.

Images are run through a couple different deduplication algorithms. The first is a simple file checksum which weeds out identical images that were double-spidered (or more likely double-posted). Such images are spot-checked to ensure they really are duplicates (i.e. have the same pixels, not just the same checksum), and are then automatically deleted.

Next images are run through an equivalence algorithm that tries to find images that aren't equal at a binary level, but are the same picture. Usually this is caused by downloading the same image at two different resolutions, or perhaps the addition of a watermark. This algorithm is incredibly weak at this point; the number of false positives bring it almost to the point of irrelevance. But it's an interesting problem.

Almost done.  Now it's time to strip off borders and stuff that aren't part of the image.  This is a simple algorithm just looking for clearly demarcated bands of color around the edges of images, and then cropping them away.

The last cleanup step is resizing. All emailed images are not more than 1024 pixels wide and 850 pixels tall. This is to ensure that images can be viewed in your email with minimal scrolling. Some images are spidered at two or three times those dimensions which would make viewing very difficult.

Once the images are all ready to go, they run through a face detection and recognition flow that identifies the people in them.  Like the content-based deduplication this is still really rough, and doesn't do a very good job.  But it's improving.

Indexing the Collection

Every night, the available images (and their ratings) are run through Weka to cluster related images together. This clustering is done entirely based on the rating data from subscribers. This is the one part of the process where I'm in WAY over my head, code-wise. But it's also the fun part. ;) Trying to synthesize what Weka does would take me a decade; even learing to use it effectively took a fair amount of effort, and I've still a long way to go.

The basic idea is to use ratings to figure out which images are similar.  "Similarity" here has nothing to do with the actual content of the image, just that there is some unknown factor that makes them similar.  This same similarity can be deduced between subscribers (John and Sam like similar things, but Paul likes something else).  Finally, there's the subject/model of the pictures which can be used to relate them (though interestingly, this does not corrrelate with the rating-based similarities very much).

Selecting and Sending the Pics

After clustering is complete, each cluster is weighted based on each subscriber's individual ratings to arrive at per-cluster rankings for the subscriber. This is basically saying "subscriber X likes images of cluster Y more than cluster Z." Armed with that information and a touch of aggregate ratings, the available images are assigned an "expected desirability" for each subscriber that hasn't been sent the image yet. The end result is entirely automated predictions of which images in the pool you'll probably like best.

At the top of every hour, the system looks for subscribers that haven't received their Pic of the Day yet. The specific subscribers that get their Pic of the Day each hour is randomized, but a one Pic per day (not zero, not two) rule is enforced. Note that "day" means calendar day Pacific (US West Coast) time, not a sequential 24-hour period, and not aligned to your local timezone.

As you'd expect, the system send the pics it thinks you'd like best most of the time. However, the system can only suggest images that have been ranked by other subscribers. Since new images must be added to the pool, a small fraction of images are chosen at random from all unranked images.  Like everything else, the system tracks it's own behaviour and increases the influx of new images when it needs them, and decreases when it doesn't.

Cleaning Up

The last major step is cleaning up. If you don't rate a pic for whatever reason, you'll continue to get your Pic of the Day each day. However, after a couple weeks, the system will send you what's called a BackPic with the image you didn't rate in it to give you another chance.

The end goal is to ensure you get the pics you want. That requires the system to have knowledge about what you like/dislike (i.e. ratings). I know as well as the next guy that sending a large volume of email is a quick way to get ignored, so BackPics are carefully managed to ensure you don't get overwhelmed. However, BackPics have proven to be of significant benefit towards the goal of maximizing ratings.

Support

While all this might sound fairly straightforward, it requires a huge amount of support code. In particular, tuning the algorithms requires easy access to all the data underlying the system in an easy to interpret way. The amount of code directly responsible for the process outlined above is dwarfed by the code for tracking the system's behaviour and monitoring it's internals.  Tracking performance over time is also crucial, as many types of tweaks take a while to manifest their results.  And, of course, the proper metrics to be monitoring change over time, based on the algorithms in play.

The most interesting subsystem is probably the monitoring bits, which is a plugable system for basically tracking any metric you can express in code (SQL, CFML, whatever).  All that data is time-aligned and then aggregated into structures that can be used to generate either snapshot reports or charts over time.  These reports are used not only by me to see how well I'm doing as the developer, but also by the system itself to monitor and and then tune/self correct it's performance.

Infrastructure is also important. As I said above, all images are stored on Amazon's S3 platform. This provides both high reliability and virtually infinite scalability for pennies a month. All the precious data is stored in a couple MySQL databases that are also backed up to S3. Similarly, the code that drives the entire system is meticulously version controlled with Subversion and, you guessed it, backed up to S3.

The server is Apache 2.2 fronting Railo 3 (on Tomcat and Sun JVM) with MySQL 5 behind it, all running on CentOS 5 Linux.  The hardware itself is a single Pentium 4 with 1GB of RAM.  And that hardware is shared with a pile of other CFML apps (all running in a single Railo instance), eight WordPress blogs, and two Magnolia CMS sites.

The spider is written almost entirely in Bash using wget. The business tier is CFCs managed by ColdSpring with hand-coded SQL throughout.  The clustering and learning algorithms are mostly Weka via ARFF transfer.  The UI is mostly static HTML through FB3Lite, though numerous performance-sensitive pieces are Ajax-driven by jQuery. mod_rewrite is used extensively for caching of generated thumbnails and virtualizing the URL space, as well as ensuring backwards compatibility.  All the visualization is done with the Google Chart API, which has replaced an SVG-based charting engine I'd built back in 2003-2004.

Subscribe Now

April, 2, 2011


This site is intentionally not tagged with content ratings labels. Protecting children by building unmanaged walls does a disservice to both child and society. Yes, I have kids. Two of them.

  • Home
  • Subscribe
  • PotD Cards
  • How does it work?
  • Contact