Hogg's Ideas: August 2012

2012-08-31

learn distortion models for common cameras

Our astronomical image-recognition software Astrometry.net (see Lang et al 2010) does a very good job on professional-grade astronomical images. It is less reliable on snapshots taken with normal, wide-angle lenses and fish-eye lenses. This is ironic, because from an information-theory point of view, they are much easier to recognize: They contain obvious, familiar constellations. The problem is that these wide-field shots have substantial geometric distortions in the field. These distortions foil Astrometry.net because they make the mapping from sky to focal plane non-conformal (squares on the sky don't go to squares on the image)

Another ironic thing about all this is that in fact these distortions are very common among cameras and very predictable. They should be extremely predictable using the EXIF meta-data in image headers, and even without that I bet the distortions live in a small family of possible choices. The project is: Find out what these standard distortions are, and fix Astrometry.net so it knows about them in advance, either by de-distorting the star list that it eats (this would be the easy option) or else by making the star-figure-matching step invariant to those kinds of distortions (that would be the hard option). Actually even easier would be to just have Astrometry.net automatically lower its tolerances as the asterisms get close to the field edges!

2012-08-30

find non-galaxies in the SDSS

The SDSS has issues with bright stars and galaxies: The pipelines were made for 17th-magnitude objects and therefore (a) the point-spread function doesn't include features like diffraction spikes or large-angle wings and (b) de-blending of much brighter galaxies is overly agressive. These are known problems with the SDSS but they haven't been analyzed (in the literature) in complete detail (meaning: many SDSS investigators know a lot about these things but they don't appear in papers in one coherent place).

In experiments conducted by Dusting Lang (CMU) and myself, we found that you can find rings of blue galaxies around red galaxies (really just blue spiral arms around red bulges) and you can find very elongated galaxies pointing at bright stars (really just unmodeled diffraction spikes in the PSF being modeled as galaxies). Both of these kinds of anomalies in the survey data are generally flagged: The SDSS pipelines are very clever about figuring out where they are going wrong or out-of-spec; most SDSS data analyses have used agressive flag cuts to remove possibly problematic galaxies. The most interesting objectives from my perspective are (1) building a full catalog or list or annotation of all (all in some category, like diffraction-spike) anomalous (that is, wrong) galaxies in the SDSS, (2) identifying anomalous galaxies that wouldn't have been caught by one of the standard, aggressive flag cuts, and (3) figuring out what fraction of agressively cut-out galaxies are in fact real and trustworthy.

I have made this post about the SDSS but of course it applies to any large imaging survey with an automatically produced catalog. Indeed, we did very similar things for USNO-B and there are more to do there. One amusing thing about the project is that the bright stars in the SDSS (the ones that show diffraction spikes) are themselves often classified as galaxies, because they become extended when they saturate. But that's a detail!

2012-08-29

build a paragraph-level index for arXiv

There is plenty of technology ready-to-use that does author-topic modeling, in part because so many of the machine-learning community's successes have been in text processing and classification. Some of this technology has been run on the arXiv at the abstract, title, meta-data level, but rarely at the full-text level. If we ran it there, we could build an author-topic model over every paragraph in the full corpus of the arXiv. This would tell us what every paragraph is about (and, amusingly, who wrote every paragraph), with (if we do it right) probabilistic output. That is, it would give probability distributions over what every paragraph is about.

If done correctly, and with a little hand-labeling of classes (and this is easy because each class would have characteristic words and phrases and authors), this could lead to a complete and exceedingly useful paragraph-level index into the arXiv. But even without the hand-labeling it would be incredibly useful: It could be used to find paragraphs in the literature that are useful and relevant to every paragraph you have written in one of your own papers, thus locating related work you might not know about. It could drive search services that find paragraphs relevant to your search terms even when they don't, themselves, contain those terms. And so on!

Dan Foreman-Mackey (NYU) first made it clear to me that this would be possible and he also took some steps towards making it happen. David Blei (Princeton) suggested to me that even from a machine-learning perspective the outcomes could be very interesting. The plagiarism paper from the arXiv people suggests working at PDF level rather than LaTeX source level; I am not sure myself which to do.

2012-08-28

will chemical tagging work?

Chemical tagging is the name given to the idea that we can match up stars in abundance space (detailed chemical properties) as well as kinematic space to figure out the origin and common orbits of stars in the Milky Way. Because it would be so valuable to figure out that different stars shared a common origin at formation (for things like orbit inference), chemical tagging could enormously improve the precision of any dynamical or galaxy-formation information coming from next-generation surveys.

In the many conversations I have seen about chemical tagging, arguments break out about whether it is possible to measure the chemical abundances of stars of different temperatures and surface gravities comparably. That is: Can we figure out that this F star has the same abundances as this other K star? Or this red giant and this main-sequence star? And it is certainly not clear: Chemical abundances are not measured at enormous precision and there are many possible biases, sources of variance, and systematic error.

My proposal is that we ask these questions not in the space of the outputs of chemical-abundance models but rather in the space of stellar spectroscopy observables. The question becomes not are the models good enough? but rather is there information in the data? And there needs to be information sufficient to distinguish thousands (yes that is the goal) of chemically distinct sub-populations.

If there is sufficient information, then in the dozens-to-hundreds-of-dimensions space of all possible absorption-line measurements (plus stellar temperature), do we see thousands of distinct families of (possibly very complex) one-dimensional loci (each locus being a birth-mass-sequence at fixed chemical abundances and age)? The idea would be to do this purely in the space of spectra but—probably necessarily—relying heavily on models to guide the eye (or really guide the code) where to look.

I have discussed this with Ken Freeman (ANU) and Mike Blanton (NYU), but as far as I know, no-one is working on it. Blanton had the great idea that we don't really need to make spectral features before starting. The question does the distribution of stellar spectra split up into many tiny, thin, curvy lines in spectrum space? can be asked with just well-calibrated spectra. And we have lots of those!

2012-08-27

emission-line clustering and classification

The BPT diagram has been incredibly productive in classifying galaxies into star-forming and AGN-powered classes. However, the diagram only shows two ratios of nearby lines; ratios of nearby lines so that dust and spectrograph calibration don't mess up the data, and two because it is a single two-dimensional plot. There might be many features in emission-line space sitting undiscovered in the data; there might be many sub-classes and rich structure within the star-forming and AGN groups.

From a data perspective, times have really changed since BPT: (1) There are dozens (well, a dozen) of visible lines in hundreds of thousands of spectra. (2) We have good noise models for the line measurements and this is especially important when they get low in signal-to-noise (as they do if you want to use many lines. (3) We have very well-calibrated spectra now, even spectrophotometrically good to a few percent in the SDSS. (4) The effects of dust attenuation are pretty well understood in the optical. So let's go high dimensional and find all the complex structure that must be there!

The first step is to measure all the lines in a long list, and measure them even when the signal-to-noise is low. We don't care about detections we care about measurements with well-understood noise. The second step is to develop dust-insensitive metrics: What is the distance in data space between two sets of dust-line measurements as a function of noise but marginalizing out the dust affecting each spectrum? Now in that space, let's do some clustering.

I have done nothing on this except discuss it, years ago, with John Moustakas (Siena College). At that time, we were thinking in terms of generating archetypes with an integer program (with my now-deceased guru Sam Roweis). You could use things like support vector machines (great for these kinds of tasks) but we have no labels to classify on. The idea is to find classes not yet discovered! Also SVMs are not sensitive to the uncertainties in the data. I would recommend something like extreme deconvolution which does density estimation of the noise-deconvolved distribution. It can deal with very low signal-to-noise data gracefully. It would have to be modified, however, to project out (marginalize out) the dust-extinction direction in line space. Not impossible but not trivial either.

2012-08-26

what is the spectrum of dust attenuation?

The SDSS has taken spectra of thousands of F-type stars, at different distances and through different amounts of interstellar dust. These stars were chosen for calibration purposes; they were chosen because they have very well-understood and consistent spectra. These have been used to calibrate the SDSS telescope, but they can also be used to calibrate interstellar dust.

The general procedure would be to start by measuring the equivalent widths of a few absorption lines—preferably a couple of Balmer lines and a couple of metal lines— consistently for all F-stars. These line EWs would provide a dust-indpendent temperature and metallicity indicator for all the stars. Compare the spectra of the F-stars at different reddening but fixed absorption-line equivalent widths (and therefore fixed temperature and metallicity) to get the dust attenuation at resolution of a few thousand. There probably isn't anything interesting there, but if there is it would be a valuable discovery.

The easiest way to do this project is by spectral stacking, but there might be methods that build a non-linear model of the stellar spectrum with three controlling parameters: Balmer EW, metal EW, and SFD-dust-map amplitude. I started discussing this project many years ago with Karl Gordon (STScI); if you want to give it a shot, send us both email for ideas (if you want to; otherwise do it and surprise us!).

2012-08-25

javascript radio interferometry simulator

This isn't quite a full-blown research project, but it could evolve into one if done correctly. I want a multi-panel browser view, with one panel being an input panel in which I can arrange and set the brightnesses of point sources in an astronomical scene or a sky patch, and then a few panels showing the real part, imaginary part, amplitude, and phase of the fourier transform of the scene. It should all run in the browser for speed and flexibility. Also, there could be panels in which you set down antennae, get baselines in the u–v plane (possibly as a function of wavelength and time as the Earth rotates), and show also the dirty-beam reconstructed scene (and maybe also the clean-beam reconstructed scene). This could be used to develop intuition about radio astronomy and the fourier transform. If done right it could also be used to plan observations (indeed, it could have an ALMA mode where it knows about the ALMA antennae). If done really right it could be used to aid in data analysis.

2012-08-24

get SDSS colors and magnitudes for very bright stars

The SDSS saturates around 14th magnitude. However, (a) the gains are set such that the CCD pixels saturate before the analog-to-digial read-out saturates, and (b) the bleeding of charge on the CCD is essentially charge-conserving. Also, when very bright stars cross the readout register in the CCD, they leave a thin 2048-pixel line across the full camera column. And also also, the stars have well-defined diffraction spikes that are visible to large angular radii.

No-one says this is easy; this is a blog of good ideas not easy ideas: For one, the detector may become weakly nonlinear shortly before CCD pixel saturation; that is, the effective gain may be lower at brighter magnitudes; any project would have to look carefully into this, and you don't have variable exposure times to use (all of SDSS was taken at 55-second exposure time for very important reasons). For another, the shape and size of the diffraction spikes might be a strong function of position in the focal plane. However, I have hope, because the charge bleeds are so very very beautiful when inspected in detail.

Some prior work on this has been done by myself and Doug Finkbeiner (Harvard). It would be worth checking in with Fink before embarking.

2012-08-23

bimodality search or kurtosis components analysis

Take the SDSS spectra (which are beautifully calibrated spectrophotometrically) and interpolate them onto a common rest-frame (de-redshifted) wavelength grid. Do clever things to interpolate over missing and corrupted data where necessary; this might involve performing a PCA and using the PCA to patch and then re-doing PCA and so on. Then re-normalize the data so that the amplitudes of all the spectra are the same; I am being vague here because I don't know the best choice for definition of amplitude. This is all pre-conditioning for the data; in principle the recommendation here could be applied to any data set; I am just proposing the SDSS spectra.

Now search for a unit-norm (or otherwised normalized) eigenspectrum such that when you dot all pre-conditioned SDSS spectra onto the eigenspectrum, you obtain a distribution of coefficients (dot products) that has minimum kurtosis. That is, instead of finding the principal components—the components with maximum variance—we will look for the platykurtic components—the components with minimum kurtosis. If you are stoked, search the orthogonal subspace for the next-to-minimum kurtosis direction and so on.

Why, you ask? Because low-kurtosis distributions are bi-modal. Indeed, early experiments (performed by Vivi Tsalmantza (MPIA) and myself back in 2008) indicate that this will identify the eigenspectra that best separate the red sequence galaxies from the blue cloud. If you really want to go to town invent a bimodality scalar that is better than kurtosis.

One note: Optimization is a challenge. This sure ain't convex. My approach back in the day was to throw down randomly generated spectra, choose ones that happened to hit fairly low kurtosis, and optimize locally from those.

2012-08-22

cosmic-ray identification

Take a set of HST data from one filter and exposure time (to start; later we will generalize) that have been CR-split (meaning: two images at each pointing). Shift-and-difference these split images to confidently identify a large number of cosmic rays. Pull out 5x5 image patches centered on cosmic-ray-corrupted pixels and 5x5 image patches not centered on cosmic-ray-corrupted pixels. Use these labeled data as training data for a supervised method that finds cosmic rays in single-image (not-CR-split) data. Improve value of HST data for all and obtain enormous financial gift from NASA in thanks (well, not really).

Notes: The most informative pixel patches will be those with faint cosmic ray pixels and those with bright stars that mimic cosmic rays. Some of this work has been started with (now graduated) NYU undergraduate Andrew Flockhart.