The SDSS spectra can be thought of as "labels" for objects detected in the imaging, each of which has ugriz photometry and some shape and position parameters. Can we train a model with this enormous amount of data to predict the spectra using the photometry? One thing that says "yes" is that photometric redshifts (for galaxies and quasars), photometric distances (for stars), and photometric temperatures and metallicities (for stars) all work well. One thing that says "no" is that there is far more information (in a technical sense) in the spectra than in the photometry. All this said, it is an absolutely great "Data Science" demonstration project, and it might create some new ideas for LSST-era astrophysics projects. In principle, it will also get us predictions about the spectral types and redshifts of many objects that lack spectra!
Is the Universe transparent? This issue has been interesting me for many years; the answer is "yes" of course, but just how transparent? There are many ways to look at transparency, but a talk by Corrales last week got me thinking about it again: Take some point sources, regress their images against brightness, color, and line-of-sight dust amplitude. Do you see a scattering halo that is correlated with dust? In the optical and UV, I think you won't, based on unpublished work I have done previously. However, there might be a tiny signal. Also, there is more likely to be an effect in the x-ray, which (with Chandra) is accessible. And there is abundant archival data for this project in every waveband from infrared to X-ray.
In imaging from a telescope with a secondary on a spider (for example, in HST imaging), bright stars show diffraction spikes. More generally, the outer parts of the point-spread function are related to the Fourier Transform of the small-scale features in the entrance aperture. The scale at which this Fourier Transform imprints on the focal plane is linearly related to wavelength (just as the angular size of the diffraction-limited PSF goes as wavelength over aperture).
This means that the diffraction spikes coming from stars contain low-resolution spectra of those stars! That is, you ought to be able to extract spectral information from the spikes. It won't be good, but it should permit measurements of colors or temperatures or SED slopes with even single-band imaging, and aid in star–quasar classification. Indeed, in HST press-release images, you can see that the diffraction spikes are little "rainbows" (see below).
The project is to take wide-band imaging from HST, in fields where stars have been measured either in multiple bands or else spectroscopically, and show that some of the scientific results could have been extracted from the single, wide band directly using the diffraction features.
Okay this idea is dumb but I would love to see it done: As Kepler goes around the Sun (no, not the Earth, the Sun), it is sometimes flying towards its field and sometimes away. This leads to classical stellar aberration (discovered by Bradley in the 1700s; Bradley was a genious, IMHO), which leads to a beaming effect, in which the field-of-view (or plate scale) changes with the projection of the velocity vector onto the field-center pointing vector. A measurement of this would only take a day or two of hard work, and would provide a measure of the speed of light in units of the velocity of Kepler in its orbit.
The Kepler spacecraft is taking incredibly precise photometric data on tens of thousands of stars for the purpose of detecting exoplanets. For many reasons, the lightcurves it returns are sensitive to the temperature of the spacecraft: The focus and astrometric map (camera calibration) of the camera changes with temperature, and the detector noise properties might be evolving too. This wouldn't be a problem (it's a space mission) but the spacecraft changes its sun angle abruptly to perform high-gain data downlink about once per month, and the temperature recovery profile depends on the orientation of the spacecraft post-downlink. Instead, there are sub-percent-level traces of the temperature history imprinted on every lightcurve. Each lightcurve responds to temperature differently, but each is sensitive.
Of course the spacecraft keeps housekeeping data with temperature information, but it hasn't been extremely useful for calibration purposes. Why not? The onboard temperature sensors are low in signal-to-noise or dynamic range, whereas the lightcurves are good (sometimes) at the part-in-hundred-thousand level. That is, there is far more temperature information in the lightcurves than in the direct temperature data! Here's the project:
Treat the housekeeping data about temperature as providing
noisy labels on the lightcurve data. Find the properties of each lightcurve that best predicts those labels. Combine information from many lightcurves to produce an extremely high signal-to-noise and precise temperature history for the spacecraft. Bonus points for constraining not just the temperature history but a thermal model too.
A conversation with Nick Suntzeff (TAMU) in Lawrence, KS, brought up the great idea (Nick's, not mine) to figure out why ground-based photometry of stars never gets better than a few milli-mags in precision. Seriously people, Kepler is at the part-per-million or better level. Why can't we do the same from the ground? Why not at least part-per-hundred-thousand? Is it something about the scintillation, the transparency, the point-spread function, the detector temperature, scattered light, sky emission, sky lines, what? Not sure how to proceed, but the project could make the next generation of projects orders of magnitude less expensive. I guess I would start by taking images of a star field with many different (very different) exposure times and at different twilight levels (Suntzeff's idea again). Could it be that all we need is better software?
Questions from Kilian Walsh (NYU) today reminded me of an old, abandoned idea: Look for evidence of a periodic universe (topological non-triviality) in the large-scale structure of galaxies. Papers by Starkman (CWRU) and collaborators (one of several examples is here) claim to rule out most interesting topologies using the CMB alone. I don't doubt these papers but (a) they effectively make very strong predictions for the large-scale structure and (b) if CMB (or topology) theory is messed up, maybe the constraints are over-interpreted.
The idea would be to take pairs of finite patches of the observed large-scale structure and look to see if there are shifts, rotations, and linear amplifications (to account for growth and bias evolution) that make their long-wavelength (low-pass filtered) density fields match. Density field tracers include the LRGs, the Lyman-alpha forest, and quasars. You need to use (relatively) high-redshift tracers if you want to test conceivably relevant topologies.
Presumably all results would be negative; that's fine. But one nice side effect would be to find structures (for example clusters of galaxies) residing in very similar environments, and by
similar I mean in terms of full three dimensional structure, not just mean density on some scale. That could be useful for testing non-linear growth of structure.
Vivi Tsalmantza and I have found many double redshift in the SDSS spectroscopy (a few examples are published here but we have many others) by modeling quasars and galaxies with a data-driven model and then fitting new data with a mixture of two things at different redshifts. We have found that finding such things is straightforward. We have also found that among all galaxies, luminous red galaxies are the easiest to model (that's no breakthrough; it has been known for a long time).
Put these two ideas together and what have you got? An incredibly simple way to find double-redshifts of massive galaxies in spectroscopy. And the objects you find would be interesting: Rarely have double redshifts been found without emission lines (LRG spectra are almost purely stellar with no nebular lines), and because the LRGs sometimes host radio sources you might even get a Hubble-constant-measuring
golden lens. For someone who knows what a spectrum is, this project is one week of coding and three weeks of CPU crushing. For someone who doesn't, it is a great learning project. If you get started, email me, because I would love to facilitate this one! I will happily provide consultation and CPU time.
We know a lot about the scalar properties of galaxies as a function of clustocentric distance: Galaxies near cluster centers tend to be redder and older and more massive and more dense than galaxies far from cluster centers. We also know a lot about the tensor properties of galaxies as a function of clustocentric distance: Background galaxies tend to be tangentially sheared and galaxies in or near the cluster have some fairly well-studied but extremely weak alignment effects. What about vector properties?
Way back in the day, star NYU undergrad Alex Quintero (now at Scripps doing oceanography, I think) and I looked at the morphologies of galaxies as a function of clustocentric position, with the hopes of finding offsets between blue and red light (say) in the direction of the cluster center. These are generically predicted if ram-pressure stripping or any other pressure effects are acting in the cluster or infall-region environments. We developed some incredibly sensitive tests, found nothing, and failed to publish (yes I know, I know).
This is worth finishing and publishing, and I would be happy to share all our secrets. It would also be worth doing some theory or simulations or interrogating some existing simulations to see more precisely what is expected. I think you can probably rule out ram-pressure stripping as a generic influence on cluster members, although maybe the simulations would say you don't expect a thing. By the way, offsets between 21-cm and optical are even more interesting, because they are seen in some cases, and are more directly relevant to the question. However, it is a bit harder to assemble the unbiased data you need to perform a sensitive experiment.
Although the Nobel Prize last year went for the accelerated expansion of the Universe, in fact acceleration is not a many-sigma result. What is a many-sigma result is that the expansion is not decelerating by as much as it should be given the mass density. This begs the question: Could gravity be weaker than expected on cosmological scales? Models with, say, an exponential cutoff of the gravitational force law at long distances are theoretically ugly (they are like massive graviton theories and usually associated with various pathologies) but as empirical objects they are nice: A model with an exponentially suppressed force law at large distance is predictive and simple.
The idea is to compute the detailed expansion history and linear growth factor (for structure formation) for a homogeneous and isotropic universe and compare to existing data. By how much is this ruled out relative to a cosmological-constant model? The answer may be
a lot but if it is only by a few sigma, then I think it would be an interesting straw-man. For one, it has the same number of free parameters (one length scale instead of one cosmological constant). For two, it would sharpen up the empirical basis for acceleration. For three, it would exercise an idea I would like to promote: Let's choose models on the joint basis of theoretical reasonableness and computability, not theoretical reasonableness alone! If we had spent the history of physics with theoretical niceness as our top priority, we would never have got the Bohr atom or quantum mechanics!
One amusing note is that if gravity does cut off at large scales, then in the very distant future, the Universe will evolve into an inhomogeneous fractal. Fractal-like inhomogeneity is something I have argued against for the present-day Universe.
After a talk by Matias Zaldarriaga (IAS) about making simulations faster, I had the following possibly stupid idea: It is possible to speed up simulations of cosmological structure formation by simulating not the full growth of structure, but just the departures away from a linear or quadratic approximation to that growth. As structure grows, smooth initial conditions condense into very high-resolution and informative structure. First observation: That growth looks like some kind of deconvolution. Second: The better you can approximate it with fast tools, the faster you can simulate (in principle) the departures or errors in the approximation. So let's fire up some machine learning!
The idea is to take the initial conditions, the result of linear perturbation theory, the result of second-order perturbation theory, and a full-up simulation, and try to infer each thing from the other (with some flexible model, like a huge, sparse linear model, or some mixture of linear models or somesuch). Train up and see if we can beat other kinds of approximations in speed or accuracy. Then see if we can use it as a basis for speeding full-precision simulations. Warning: If you don't do this carefully, you might end up learning something about gravitational collapse in the Universe!. My advice, if you want to get started, is to ask Zaldarriaga for the inputs and outputs he used, because he is sitting on the ideal training sets for this, and may be willing to share.
For many problems, the computer scientists tell us to use expectation maximization. For example, in fitting a distribution with a mixture of Gaussians, EM is the bee's knees, apparently. This surprises me, because the EM optimization is so slow and predictable; I am guessing that a more aggressive optimization might beat it. Of course a more aggressive optimization might not be protected by the same guarantees as EM (which is super stable, even in high dimensions). It would be a service to humanity to investigate this and report places where EM can be beat. Of course this may all have been done; I would ask my local experts before embarking.
The Gaia mission needs to centroid stars with accuracies at the 10-3-pixel level. At the same time, the detector will be affected by charge-transfer inefficiency degradation as the instrument is battered by cosmic radiation; this causes significant magnitude-dependent centroid shifts. The team has been showing that with reasonable models of charge-transfer inefficiency, they can reach their scientific goals. One question I am interested in—a boring but very important question—is whether it is possible to figure out and fix the CTI issues without a good model up-front. (I am anticipating that the model won't be accurate, although the team is analyzing lab CCDs subject to sensible, realistic damage.) The shape and magnitude of the effects on the point-spread function and positional offsets will be a function of stellar magnitude (brightness) and position on the chip. They might also have something to do with what stars have crossed the chip in advance of the current star. The idea is to build a non-trivial fake data stream and then analyze it without knowing what was put in: Can you recover and model all the effects at sufficient precision after learning the time-evolving non-trivial model on the science data themselves? The answer—which I expect to be
yes—has implications for Gaia and every precision experiment to follow.
In order to work on such subjects I built a one-dimensional (yes the sky is a circle, not a 2-sphere) Gaia simulator. It currently doesn't do what is needed, so fork it and start coding! Or build your own. Or get serious and make a full mission simulator. But my point is not
Will Gaia work? it is
Can we make Gaia analysis less dependent on mechanistic CCD models? In the process we might make it more precise overall. Enhanced goal: Analyze all of Gaia's mission choices with the model.
At coffee this morning, Christopher Stumm (Etsy), Dan Foreman-Mackey (NYU), and I worked up the following idea of Stumm's: Every week, on a blog or (I prefer) in a short arXiv-only white paper, one refereed paper is taken from the scientific literature and its results are reproduced, as well as possible, given the content of the paper and the available data. I expect almost every paper to fail (that is, not be reproducible), of course, because almost every paper contains proprietary code or data or else is too vague to specify what was done. The astronomical literature is particularly interesting for this because many papers are based on public data; for those it comes down only to code and procedures; indeed I remember Bob Hanisch (STScI) giving a talk at ADASS showing that it is very hard to reproduce the results of typical papers based on HST data, despite the fact that all the data and almost all the code people use on them are public.
Stumm, Foreman-Mackey, and I discussed economic models and incentive models to make this happen. I think whoever did this would succeed scientifically, if he or she did it well, both because it would have huge impact and because it would create many new insights. But on the other hand it would take significant guts and a hell of a lot of time. If you want to do it, sign me up as one of your reproducibility agents! I think anyone involved would learn a huge amount about the science (more than they learn about reproducibility). In the end, it is the community that would benefit most, though. Radical!
When we share astronomical images, we expect the images to have standards-compliant descriptions of their astrometric calibration—the mapping between image position and sky position—in their headers. Naturally, it is just as important to have descriptions of the point-spread-function, for almost any astronomical activity (like photometry, source matching, or color measurement). And yet we have no standards. (Even the WCS standard for astrometry is seriously out of date). Develop a PSF standard!
Requirements include: It should be very flexible. It should permit variations of the PSF with position in the image. It should have a specified relationship between the stellar position and the position of the mean, median, or mode of the PSF itself. That latter point relates to the fact that astrometric distortions can be sucked up into PSF variations if you permit the mode of the PSF to drift relative to the star postion. I like that freedom, but whether you permit it or not it should be explicit.
Let me say at the outset that I don't think that imputing missing data is a good idea in general. However, missing-data imputation is a form of cross-validation that provides a very good test of models or methods. My suggestion would be to take a large number of spectra (say stars or galaxies in SDSS), censor patches (multi-pixel segments) of them randomly, saving the censored patches. Build data-driven models using the uncensored data by means of PCA, HMF, mixture-of-Gaussians EM, and XD, at different levels of complexity (different numbers of components). Compare in their ability to reconstruct the censored data. Then use the best of the methods as your spectral models for, for example, redshift identification! Now that I type that I realize the best target data are the LRGs in SDSS-III BOSS, where the (low) redshift failure rate could be pushed lower with a better model. Advanced goal: Go hierarchical and infer/understand priors too.
Data-driven models tend to be very naive about noise. Jo Bovy (IAS) built a great data-driven model of the quasar population that makes use of our highly vetted photometric noise model, to produce the best-performing photometric redshift system for quasars (that I know). This has been a great success of Bovy's extreme deconvolution (XD) hierarchical distribution modeling code. Let's do this again but for galaxies!
We know more about galaxies than we do quasars—so maybe a data-driven model doesn't make much sense—but we also know that data-driven models (even ones that don't take account of the noise) perform comparably well to theory-driven models, when it comes to galaxy photometric redshift prediction. So a data-driven model that takes account of the noise might kick ass. This was strongly recommended to me by Emmanuel Bertin (IAP). In other news, Bernhard Schölkopf (MPI-IS) opined to me that it might be the causal nature of the XD model that makes it so effective. I guess that's a non-sequitur.
Here at Astrometry.net headquarters we get a lot of images of the night sky where the exposure is long and the stars have trailed into partial circular arcs. If we could
de-blur these into images of the sky, this would be great: Every one of these trailed images would provide a photometric measurement of every star. Advanced goal: Every one of these trailed images would provide a photometric light curve of every star. That would be sweet! Not sure if this is really research, but it would be cool.
The problem is easy, because every star traverses the same angle in a circle with the same center. Easy! But the problem is hard because the images are generally taken with cameras that have substantial field distortions (distortions in the focal plane away from a pure tangent-plane projection of the sky). Still, it seems totally do-able!
Pedants beware: Of course I know that it is the Earth rotating and not the sky rotating! But yes, I have made that pedantic point on occasion too.
In Holmes et al 2012 (new version coming soon) we showed practical methods for designing an imaging survey for high-quality photometric calibration: You don't need a separate calibration program (separate from the main science program) if you design it our way. This is like a
scalar calibration: We are asking
What is the sensitivity at every location in the focal plane? We could have asked
What is the astrometric distortion away from a tangent-plane at every location in the focal plane?, which is a vector calibration question, or we could have asked
What is the point-spread function at every location in the focal plane?, which is a tensor calibration question. Of course the astrometry and PSF vary with time in ground-based surveys, but for space-based surveys these are relevant self-calibration questions. We learned in the above-cited paper that certain kinds of redundancy and non-redundancy make scalar calibration work, but the requirements will go up as the rank of the calibration goes up too. So repeat for these higher-order calibrations! Whatever you do might be highly relevant for Euclid or WFIRST, which both depend crucially on the ability to calibrate precisely. Even ground-based surveys, though dominated by atmospheric effects, might have fixed distortions in the WCS and PSF that a good survey strategy could uncover better than any separate calibration program.
The Astrometry.net system sees a huge amount of heterogeneous data, from wide-field snapshots to very narrow-field professional images, to all-sky fish-eye cloud cameras. Any image that is successfully calibrated by the system has been matched to a dataabase of four-star figures (quads) and then verified probabilistically using all the stars in the image and in the USNO-B1.0 Catalog in that region (down to some effective magnitude cut). Of course the quad index and the catalog are both suspect, in the sense that they both contain stars that are either non-existent or else have wrong properties. The amusing thing is that we could construct a graph in which the nodes are catalog entries and the edges are instances in which pairs of stars have been observed in the same image.
This graph would contain an enormous amount of information about the sky. For example, the network could be used to create a brightness ordering of stars on the sky, which would be amusing. But more importantly for us, the covisibility information would tell us what pairs of stars we should be using together in quads, and what pairs we shouldn't. That analysis would take account not just of their relative magnitudes, but also the typical angular scales of the images in which stars of that magnitude tend to be detected. It would also identify (as nodes with few or no edges) catalog entries that don't correspond to stars, and groups of catalog entries that are created by certain kinds of artifacts (like handwriting on the photographic plates, etc) that generate certain kinds of false positive matches in our calibrations.
This idea was first suggested to Dustin Lang (CMU) and me by Sven Dickinson (Toronto) at Lang's PhD defense. Advanced goal: Make a directed graph, with arrows going from brighter to fainter. Then use statistics of edge directions to do a better job on brightness ranking and also classify images by bandpass, etc. Even more advanced goal: Evolve away from star catalogs to covisible-asterism catalogs! At the bright end (first or second magnitude), we might be able to propose a better set of constellations.
Here's an old one from the vault: Plot the surface brightness of early-type galaxies (red, dead) as a function of ellipticity and show that surface brightness rises with ellipticity. This is what is expected if early-type galaxies are transparent and oblate. I know from nearly completing this project many years ago that this will work well for lower-luminosity early types and badly for higher-luminosity early types. The cool thing is that, under the oblate assumption, the true three-dimensional axis-ratio and three-dimensional central stellar density distribution function can be inferred from the observed two-dimensional distributions under the (weak) assumption of isotropy of the observations. That assumption isn't perfectly true but it is close. You can use high signal-to-noise imaging and SDSS spectroscopy to do the object selection, so observational noise in selection and measurement won't provide big problems.
This is another Scott Tremaine (IAS) project. Mike Blanton (NYU) and I basically did this many years ago with SDSS data, but we never took it through the last mile to publication, so it is wide open. Actually, it seems likely that someone has done this previously, so start with a literature search! Bonus points: Figure out what's up with the high-luminosity early types. They are either triaxial or a mix of oblate and prolate.