Hogg's Ideas: 2012

2012-11-25

limits to ground-based photometry?

A conversation with Nick Suntzeff (TAMU) in Lawrence, KS, brought up the great idea (Nick's, not mine) to figure out why ground-based photometry of stars never gets better than a few milli-mags in precision. Seriously people, Kepler is at the part-per-million or better level. Why can't we do the same from the ground? Why not at least part-per-hundred-thousand? Is it something about the scintillation, the transparency, the point-spread function, the detector temperature, scattered light, sky emission, sky lines, what? Not sure how to proceed, but the project could make the next generation of projects orders of magnitude less expensive. I guess I would start by taking images of a star field with many different (very different) exposure times and at different twilight levels (Suntzeff's idea again). Could it be that all we need is better software?

2012-10-24

find or rule out a periodic universe (via structure)

Questions from Kilian Walsh (NYU) today reminded me of an old, abandoned idea: Look for evidence of a periodic universe (topological non-triviality) in the large-scale structure of galaxies. Papers by Starkman (CWRU) and collaborators (one of several examples is here) claim to rule out most interesting topologies using the CMB alone. I don't doubt these papers but (a) they effectively make very strong predictions for the large-scale structure and (b) if CMB (or topology) theory is messed up, maybe the constraints are over-interpreted.

The idea would be to take pairs of finite patches of the observed large-scale structure and look to see if there are shifts, rotations, and linear amplifications (to account for growth and bias evolution) that make their long-wavelength (low-pass filtered) density fields match. Density field tracers include the LRGs, the Lyman-alpha forest, and quasars. You need to use (relatively) high-redshift tracers if you want to test conceivably relevant topologies.

Presumably all results would be negative; that's fine. But one nice side effect would be to find structures (for example clusters of galaxies) residing in very similar environments, and by similar I mean in terms of full three dimensional structure, not just mean density on some scale. That could be useful for testing non-linear growth of structure.

2012-10-21

find LRG-LRG double redshifts

Vivi Tsalmantza and I have found many double redshift in the SDSS spectroscopy (a few examples are published here but we have many others) by modeling quasars and galaxies with a data-driven model and then fitting new data with a mixture of two things at different redshifts. We have found that finding such things is straightforward. We have also found that among all galaxies, luminous red galaxies are the easiest to model (that's no breakthrough; it has been known for a long time).

Put these two ideas together and what have you got? An incredibly simple way to find double-redshifts of massive galaxies in spectroscopy. And the objects you find would be interesting: Rarely have double redshifts been found without emission lines (LRG spectra are almost purely stellar with no nebular lines), and because the LRGs sometimes host radio sources you might even get a Hubble-constant-measuring golden lens. For someone who knows what a spectrum is, this project is one week of coding and three weeks of CPU crushing. For someone who doesn't, it is a great learning project. If you get started, email me, because I would love to facilitate this one! I will happily provide consultation and CPU time.

2012-10-10

find or rule out ram pressure stripping in galaxy clusters

We know a lot about the scalar properties of galaxies as a function of clustocentric distance: Galaxies near cluster centers tend to be redder and older and more massive and more dense than galaxies far from cluster centers. We also know a lot about the tensor properties of galaxies as a function of clustocentric distance: Background galaxies tend to be tangentially sheared and galaxies in or near the cluster have some fairly well-studied but extremely weak alignment effects. What about vector properties?

Way back in the day, star NYU undergrad Alex Quintero (now at Scripps doing oceanography, I think) and I looked at the morphologies of galaxies as a function of clustocentric position, with the hopes of finding offsets between blue and red light (say) in the direction of the cluster center. These are generically predicted if ram-pressure stripping or any other pressure effects are acting in the cluster or infall-region environments. We developed some incredibly sensitive tests, found nothing, and failed to publish (yes I know, I know).

This is worth finishing and publishing, and I would be happy to share all our secrets. It would also be worth doing some theory or simulations or interrogating some existing simulations to see more precisely what is expected. I think you can probably rule out ram-pressure stripping as a generic influence on cluster members, although maybe the simulations would say you don't expect a thing. By the way, offsets between 21-cm and optical are even more interesting, because they are seen in some cases, and are more directly relevant to the question. However, it is a bit harder to assemble the unbiased data you need to perform a sensitive experiment.

2012-10-09

cosmology with finite-range gravity

Although the Nobel Prize last year went for the accelerated expansion of the Universe, in fact acceleration is not a many-sigma result. What is a many-sigma result is that the expansion is not decelerating by as much as it should be given the mass density. This begs the question: Could gravity be weaker than expected on cosmological scales? Models with, say, an exponential cutoff of the gravitational force law at long distances are theoretically ugly (they are like massive graviton theories and usually associated with various pathologies) but as empirical objects they are nice: A model with an exponentially suppressed force law at large distance is predictive and simple.

The idea is to compute the detailed expansion history and linear growth factor (for structure formation) for a homogeneous and isotropic universe and compare to existing data. By how much is this ruled out relative to a cosmological-constant model? The answer may be a lot but if it is only by a few sigma, then I think it would be an interesting straw-man. For one, it has the same number of free parameters (one length scale instead of one cosmological constant). For two, it would sharpen up the empirical basis for acceleration. For three, it would exercise an idea I would like to promote: Let's choose models on the joint basis of theoretical reasonableness and computability, not theoretical reasonableness alone! If we had spent the history of physics with theoretical niceness as our top priority, we would never have got the Bohr atom or quantum mechanics!

One amusing note is that if gravity does cut off at large scales, then in the very distant future, the Universe will evolve into an inhomogeneous fractal. Fractal-like inhomogeneity is something I have argued against for the present-day Universe.

2012-10-06

cosmological simulation as deconvolution

After a talk by Matias Zaldarriaga (IAS) about making simulations faster, I had the following possibly stupid idea: It is possible to speed up simulations of cosmological structure formation by simulating not the full growth of structure, but just the departures away from a linear or quadratic approximation to that growth. As structure grows, smooth initial conditions condense into very high-resolution and informative structure. First observation: That growth looks like some kind of deconvolution. Second: The better you can approximate it with fast tools, the faster you can simulate (in principle) the departures or errors in the approximation. So let's fire up some machine learning!

The idea is to take the initial conditions, the result of linear perturbation theory, the result of second-order perturbation theory, and a full-up simulation, and try to infer each thing from the other (with some flexible model, like a huge, sparse linear model, or some mixture of linear models or somesuch). Train up and see if we can beat other kinds of approximations in speed or accuracy. Then see if we can use it as a basis for speeding full-precision simulations. Warning: If you don't do this carefully, you might end up learning something about gravitational collapse in the Universe!. My advice, if you want to get started, is to ask Zaldarriaga for the inputs and outputs he used, because he is sitting on the ideal training sets for this, and may be willing to share.

2012-10-03

compare EM to other optimization algorithms

For many problems, the computer scientists tell us to use expectation maximization. For example, in fitting a distribution with a mixture of Gaussians, EM is the bee's knees, apparently. This surprises me, because the EM optimization is so slow and predictable; I am guessing that a more aggressive optimization might beat it. Of course a more aggressive optimization might not be protected by the same guarantees as EM (which is super stable, even in high dimensions). It would be a service to humanity to investigate this and report places where EM can be beat. Of course this may all have been done; I would ask my local experts before embarking.

2012-09-19

Can you fix charge-transfer inefficiency without a theory-driven model?

The Gaia mission needs to centroid stars with accuracies at the 10^-3-pixel level. At the same time, the detector will be affected by charge-transfer inefficiency degradation as the instrument is battered by cosmic radiation; this causes significant magnitude-dependent centroid shifts. The team has been showing that with reasonable models of charge-transfer inefficiency, they can reach their scientific goals. One question I am interested in—a boring but very important question—is whether it is possible to figure out and fix the CTI issues without a good model up-front. (I am anticipating that the model won't be accurate, although the team is analyzing lab CCDs subject to sensible, realistic damage.) The shape and magnitude of the effects on the point-spread function and positional offsets will be a function of stellar magnitude (brightness) and position on the chip. They might also have something to do with what stars have crossed the chip in advance of the current star. The idea is to build a non-trivial fake data stream and then analyze it without knowing what was put in: Can you recover and model all the effects at sufficient precision after learning the time-evolving non-trivial model on the science data themselves? The answer—which I expect to be yes—has implications for Gaia and every precision experiment to follow.

In order to work on such subjects I built a one-dimensional (yes the sky is a circle, not a 2-sphere) Gaia simulator. It currently doesn't do what is needed, so fork it and start coding! Or build your own. Or get serious and make a full mission simulator. But my point is not Will Gaia work? it is Can we make Gaia analysis less dependent on mechanistic CCD models? In the process we might make it more precise overall. Enhanced goal: Analyze all of Gaia's mission choices with the model.

2012-09-16

scientific reproducibility police

At coffee this morning, Christopher Stumm (Etsy), Dan Foreman-Mackey (NYU), and I worked up the following idea of Stumm's: Every week, on a blog or (I prefer) in a short arXiv-only white paper, one refereed paper is taken from the scientific literature and its results are reproduced, as well as possible, given the content of the paper and the available data. I expect almost every paper to fail (that is, not be reproducible), of course, because almost every paper contains proprietary code or data or else is too vague to specify what was done. The astronomical literature is particularly interesting for this because many papers are based on public data; for those it comes down only to code and procedures; indeed I remember Bob Hanisch (STScI) giving a talk at ADASS showing that it is very hard to reproduce the results of typical papers based on HST data, despite the fact that all the data and almost all the code people use on them are public.

Stumm, Foreman-Mackey, and I discussed economic models and incentive models to make this happen. I think whoever did this would succeed scientifically, if he or she did it well, both because it would have huge impact and because it would create many new insights. But on the other hand it would take significant guts and a hell of a lot of time. If you want to do it, sign me up as one of your reproducibility agents! I think anyone involved would learn a huge amount about the science (more than they learn about reproducibility). In the end, it is the community that would benefit most, though. Radical!

2012-09-15

standards for point-spread-function meta data

When we share astronomical images, we expect the images to have standards-compliant descriptions of their astrometric calibration—the mapping between image position and sky position—in their headers. Naturally, it is just as important to have descriptions of the point-spread-function, for almost any astronomical activity (like photometry, source matching, or color measurement). And yet we have no standards. (Even the WCS standard for astrometry is seriously out of date). Develop a PSF standard!

Requirements include: It should be very flexible. It should permit variations of the PSF with position in the image. It should have a specified relationship between the stellar position and the position of the mean, median, or mode of the PSF itself. That latter point relates to the fact that astrometric distortions can be sucked up into PSF variations if you permit the mode of the PSF to drift relative to the star postion. I like that freedom, but whether you permit it or not it should be explicit.

2012-09-12

impute missing data in spectra

Let me say at the outset that I don't think that imputing missing data is a good idea in general. However, missing-data imputation is a form of cross-validation that provides a very good test of models or methods. My suggestion would be to take a large number of spectra (say stars or galaxies in SDSS), censor patches (multi-pixel segments) of them randomly, saving the censored patches. Build data-driven models using the uncensored data by means of PCA, HMF, mixture-of-Gaussians EM, and XD, at different levels of complexity (different numbers of components). Compare in their ability to reconstruct the censored data. Then use the best of the methods as your spectral models for, for example, redshift identification! Now that I type that I realize the best target data are the LRGs in SDSS-III BOSS, where the (low) redshift failure rate could be pushed lower with a better model. Advanced goal: Go hierarchical and infer/understand priors too.

2012-09-11

galaxy photometric redshifts with XD

Data-driven models tend to be very naive about noise. Jo Bovy (IAS) built a great data-driven model of the quasar population that makes use of our highly vetted photometric noise model, to produce the best-performing photometric redshift system for quasars (that I know). This has been a great success of Bovy's extreme deconvolution (XD) hierarchical distribution modeling code. Let's do this again but for galaxies!

We know more about galaxies than we do quasars—so maybe a data-driven model doesn't make much sense—but we also know that data-driven models (even ones that don't take account of the noise) perform comparably well to theory-driven models, when it comes to galaxy photometric redshift prediction. So a data-driven model that takes account of the noise might kick ass. This was strongly recommended to me by Emmanuel Bertin (IAP). In other news, Bernhard Schölkopf (MPI-IS) opined to me that it might be the causal nature of the XD model that makes it so effective. I guess that's a non-sequitur.

2012-09-10

de-blur long exposures that show the rotation of the sky

Here at Astrometry.net headquarters we get a lot of images of the night sky where the exposure is long and the stars have trailed into partial circular arcs. If we could de-blur these into images of the sky, this would be great: Every one of these trailed images would provide a photometric measurement of every star. Advanced goal: Every one of these trailed images would provide a photometric light curve of every star. That would be sweet! Not sure if this is really research, but it would be cool.

The problem is easy, because every star traverses the same angle in a circle with the same center. Easy! But the problem is hard because the images are generally taken with cameras that have substantial field distortions (distortions in the focal plane away from a pure tangent-plane projection of the sky). Still, it seems totally do-able!

Pedants beware: Of course I know that it is the Earth rotating and not the sky rotating! But yes, I have made that pedantic point on occasion too.

2012-09-07

design strategy for vector and tensor calibration

In Holmes et al 2012 (new version coming soon) we showed practical methods for designing an imaging survey for high-quality photometric calibration: You don't need a separate calibration program (separate from the main science program) if you design it our way. This is like a scalar calibration: We are asking What is the sensitivity at every location in the focal plane? We could have asked What is the astrometric distortion away from a tangent-plane at every location in the focal plane?, which is a vector calibration question, or we could have asked What is the point-spread function at every location in the focal plane?, which is a tensor calibration question. Of course the astrometry and PSF vary with time in ground-based surveys, but for space-based surveys these are relevant self-calibration questions. We learned in the above-cited paper that certain kinds of redundancy and non-redundancy make scalar calibration work, but the requirements will go up as the rank of the calibration goes up too. So repeat for these higher-order calibrations! Whatever you do might be highly relevant for Euclid or WFIRST, which both depend crucially on the ability to calibrate precisely. Even ground-based surveys, though dominated by atmospheric effects, might have fixed distortions in the WCS and PSF that a good survey strategy could uncover better than any separate calibration program.

2012-09-06

track covisibility of stars

The Astrometry.net system sees a huge amount of heterogeneous data, from wide-field snapshots to very narrow-field professional images, to all-sky fish-eye cloud cameras. Any image that is successfully calibrated by the system has been matched to a dataabase of four-star figures (quads) and then verified probabilistically using all the stars in the image and in the USNO-B1.0 Catalog in that region (down to some effective magnitude cut). Of course the quad index and the catalog are both suspect, in the sense that they both contain stars that are either non-existent or else have wrong properties. The amusing thing is that we could construct a graph in which the nodes are catalog entries and the edges are instances in which pairs of stars have been observed in the same image.

This graph would contain an enormous amount of information about the sky. For example, the network could be used to create a brightness ordering of stars on the sky, which would be amusing. But more importantly for us, the covisibility information would tell us what pairs of stars we should be using together in quads, and what pairs we shouldn't. That analysis would take account not just of their relative magnitudes, but also the typical angular scales of the images in which stars of that magnitude tend to be detected. It would also identify (as nodes with few or no edges) catalog entries that don't correspond to stars, and groups of catalog entries that are created by certain kinds of artifacts (like handwriting on the photographic plates, etc) that generate certain kinds of false positive matches in our calibrations.

This idea was first suggested to Dustin Lang (CMU) and me by Sven Dickinson (Toronto) at Lang's PhD defense. Advanced goal: Make a directed graph, with arrows going from brighter to fainter. Then use statistics of edge directions to do a better job on brightness ranking and also classify images by bandpass, etc. Even more advanced goal: Evolve away from star catalogs to covisible-asterism catalogs! At the bright end (first or second magnitude), we might be able to propose a better set of constellations.

2012-09-05

show that low-luminosity early-type galaxies are oblate

Here's an old one from the vault: Plot the surface brightness of early-type galaxies (red, dead) as a function of ellipticity and show that surface brightness rises with ellipticity. This is what is expected if early-type galaxies are transparent and oblate. I know from nearly completing this project many years ago that this will work well for lower-luminosity early types and badly for higher-luminosity early types. The cool thing is that, under the oblate assumption, the true three-dimensional axis-ratio and three-dimensional central stellar density distribution function can be inferred from the observed two-dimensional distributions under the (weak) assumption of isotropy of the observations. That assumption isn't perfectly true but it is close. You can use high signal-to-noise imaging and SDSS spectroscopy to do the object selection, so observational noise in selection and measurement won't provide big problems.

This is another Scott Tremaine (IAS) project. Mike Blanton (NYU) and I basically did this many years ago with SDSS data, but we never took it through the last mile to publication, so it is wide open. Actually, it seems likely that someone has done this previously, so start with a literature search! Bonus points: Figure out what's up with the high-luminosity early types. They are either triaxial or a mix of oblate and prolate.

2012-09-04

find catastrophes in the stellar distribution

In Zolotov et al (2011) we asked the question: Might tiny dwarf galaxy Willman 1 be just a cusp in the stellar distribution of the Milky Way? If you generically have lines and sheets in phase space—and we very strongly believe that the Milky Way does—then generically you will have folds in those (in non-trivial projections they are required), and those folds generically produce catastrophes (localized regions of very high density) of various kinds (folds, cusps, swallowtails, and so on), which could mimic gravitationally bound or recently disrupted overdensities in the stellar distribution. The cool thing is that the catastrophes have quantitative two-dimensional morphologies that are very strongly constrained by mathematics (not just physics). The likelihood test we did in the Zolotov paper could easily be expanded into a search technique, maybe with some color-magnitude-diagram filtering mixed in. The catastrophes pretty much have to be there so get ready to get rich and famous! If you go there, send email to Scott Tremaine (IAS), who first proposed this idea to me.

2012-09-03

analyze quadratic star centroiding

Inside the core SDSS pipelines and inside the Astrometry.net source-detection code simplexy, centroiding—measurement of star positions in the image—is performed by fitting two-dimensional second-order polynomials to the central 3×3 pixel patch centered on the brightest pixel of each star. This is known to work far better than taking first moments of the light distribution (integrals of x and y times the brightness above background) for the (possibly obvious) reason that it is a quasi-justified (in terms of likelihood) fit.

Of course not all of the information about a star's position is contained in that central 3×3 pixel patch (and the method doesn't make use of any point-spread function information to boot). For this reason, Jo Bovy (IAS) and I did some work a few years ago to test it. Things intervened and we never finished, but our preliminary results were really surprising: For well-behaved point-spread functions, the two-dimensional quadratic fit in the 3×3 patch performed almost indistinguishably from fits that made use of the true point-spread function and larger patches. That is, it appeared in our early tests that the 3×3 patch does contain most of the centroiding information! A good research project would see how the 3×3 patch inference degrades relative to the point-spread-function inference, as a function of PSF properties and the signal-to-noise, with an eye to analyzing when we need to be thinking about doing better. I will call out Adrian Price-Whelan (Columbia) here, because he is all set up with the machinery to do this!

2012-09-02

Find more of the GD-1 stream

The GD-1 stream spans many tens of degrees in the SDSS data. The stellar density in the stream is inhomogeneous, but the stream appears to be terminated by the survey boundary, and not before. So we should be able to find much more of it! And more stream means better constraints on the mass model for the Milky Way and the formation of cold streams. A few years back we made a model of GD-1, so we can predict where the stream will be on the unobserved parts of the sky and at what heliocentric distance. These properties of the stream set the parameters for a simple (say) three-color ground-based imaging survey to recover the stream in the Southern Hemisphere. Before you go get the observing time, I would recommend looking in the various data archives; there might already be sufficient data out there to map parts of the stream right now.

2012-09-01

remove satellite trails from arbitrary astronomical imaging

Satellite trails appear as long lines in astronomical imaging, often nearly unresolved or slightly resolved. They are easy to find, fit, and subtract away, at least in principle. I have had several undergraduate researchers, however, who got close but couldn't deliver a robust, reliable piece of code.

The code I imagine takes an image (and an optional inverse variance image). It identifies if the image contains a satellite trail (possibly using the Hough Transform and some heuristics). If it does, it fits the trail using robust fitting techniques. If that all works, it returns to the user an updated image and an updated inverse variance map. Not hard! The only hard parts are making it robust and making it fast. I have a lot of good ideas on both parts of that; I think this is very do-able, and it is only a few weeks work for the right person. It would be hella useful too, especially for the human-viewable image projects I am working on. Enhanced goal: Fit for satellite tumbling or blinking (both things are common in the data I have).

2012-08-31

learn distortion models for common cameras

Our astronomical image-recognition software Astrometry.net (see Lang et al 2010) does a very good job on professional-grade astronomical images. It is less reliable on snapshots taken with normal, wide-angle lenses and fish-eye lenses. This is ironic, because from an information-theory point of view, they are much easier to recognize: They contain obvious, familiar constellations. The problem is that these wide-field shots have substantial geometric distortions in the field. These distortions foil Astrometry.net because they make the mapping from sky to focal plane non-conformal (squares on the sky don't go to squares on the image)

Another ironic thing about all this is that in fact these distortions are very common among cameras and very predictable. They should be extremely predictable using the EXIF meta-data in image headers, and even without that I bet the distortions live in a small family of possible choices. The project is: Find out what these standard distortions are, and fix Astrometry.net so it knows about them in advance, either by de-distorting the star list that it eats (this would be the easy option) or else by making the star-figure-matching step invariant to those kinds of distortions (that would be the hard option). Actually even easier would be to just have Astrometry.net automatically lower its tolerances as the asterisms get close to the field edges!

2012-08-30

find non-galaxies in the SDSS

The SDSS has issues with bright stars and galaxies: The pipelines were made for 17th-magnitude objects and therefore (a) the point-spread function doesn't include features like diffraction spikes or large-angle wings and (b) de-blending of much brighter galaxies is overly agressive. These are known problems with the SDSS but they haven't been analyzed (in the literature) in complete detail (meaning: many SDSS investigators know a lot about these things but they don't appear in papers in one coherent place).

In experiments conducted by Dusting Lang (CMU) and myself, we found that you can find rings of blue galaxies around red galaxies (really just blue spiral arms around red bulges) and you can find very elongated galaxies pointing at bright stars (really just unmodeled diffraction spikes in the PSF being modeled as galaxies). Both of these kinds of anomalies in the survey data are generally flagged: The SDSS pipelines are very clever about figuring out where they are going wrong or out-of-spec; most SDSS data analyses have used agressive flag cuts to remove possibly problematic galaxies. The most interesting objectives from my perspective are (1) building a full catalog or list or annotation of all (all in some category, like diffraction-spike) anomalous (that is, wrong) galaxies in the SDSS, (2) identifying anomalous galaxies that wouldn't have been caught by one of the standard, aggressive flag cuts, and (3) figuring out what fraction of agressively cut-out galaxies are in fact real and trustworthy.

I have made this post about the SDSS but of course it applies to any large imaging survey with an automatically produced catalog. Indeed, we did very similar things for USNO-B and there are more to do there. One amusing thing about the project is that the bright stars in the SDSS (the ones that show diffraction spikes) are themselves often classified as galaxies, because they become extended when they saturate. But that's a detail!

2012-08-29

build a paragraph-level index for arXiv

There is plenty of technology ready-to-use that does author-topic modeling, in part because so many of the machine-learning community's successes have been in text processing and classification. Some of this technology has been run on the arXiv at the abstract, title, meta-data level, but rarely at the full-text level. If we ran it there, we could build an author-topic model over every paragraph in the full corpus of the arXiv. This would tell us what every paragraph is about (and, amusingly, who wrote every paragraph), with (if we do it right) probabilistic output. That is, it would give probability distributions over what every paragraph is about.

If done correctly, and with a little hand-labeling of classes (and this is easy because each class would have characteristic words and phrases and authors), this could lead to a complete and exceedingly useful paragraph-level index into the arXiv. But even without the hand-labeling it would be incredibly useful: It could be used to find paragraphs in the literature that are useful and relevant to every paragraph you have written in one of your own papers, thus locating related work you might not know about. It could drive search services that find paragraphs relevant to your search terms even when they don't, themselves, contain those terms. And so on!

Dan Foreman-Mackey (NYU) first made it clear to me that this would be possible and he also took some steps towards making it happen. David Blei (Princeton) suggested to me that even from a machine-learning perspective the outcomes could be very interesting. The plagiarism paper from the arXiv people suggests working at PDF level rather than LaTeX source level; I am not sure myself which to do.

2012-08-28

will chemical tagging work?

Chemical tagging is the name given to the idea that we can match up stars in abundance space (detailed chemical properties) as well as kinematic space to figure out the origin and common orbits of stars in the Milky Way. Because it would be so valuable to figure out that different stars shared a common origin at formation (for things like orbit inference), chemical tagging could enormously improve the precision of any dynamical or galaxy-formation information coming from next-generation surveys.

In the many conversations I have seen about chemical tagging, arguments break out about whether it is possible to measure the chemical abundances of stars of different temperatures and surface gravities comparably. That is: Can we figure out that this F star has the same abundances as this other K star? Or this red giant and this main-sequence star? And it is certainly not clear: Chemical abundances are not measured at enormous precision and there are many possible biases, sources of variance, and systematic error.

My proposal is that we ask these questions not in the space of the outputs of chemical-abundance models but rather in the space of stellar spectroscopy observables. The question becomes not are the models good enough? but rather is there information in the data? And there needs to be information sufficient to distinguish thousands (yes that is the goal) of chemically distinct sub-populations.

If there is sufficient information, then in the dozens-to-hundreds-of-dimensions space of all possible absorption-line measurements (plus stellar temperature), do we see thousands of distinct families of (possibly very complex) one-dimensional loci (each locus being a birth-mass-sequence at fixed chemical abundances and age)? The idea would be to do this purely in the space of spectra but—probably necessarily—relying heavily on models to guide the eye (or really guide the code) where to look.

I have discussed this with Ken Freeman (ANU) and Mike Blanton (NYU), but as far as I know, no-one is working on it. Blanton had the great idea that we don't really need to make spectral features before starting. The question does the distribution of stellar spectra split up into many tiny, thin, curvy lines in spectrum space? can be asked with just well-calibrated spectra. And we have lots of those!

2012-08-27

emission-line clustering and classification

The BPT diagram has been incredibly productive in classifying galaxies into star-forming and AGN-powered classes. However, the diagram only shows two ratios of nearby lines; ratios of nearby lines so that dust and spectrograph calibration don't mess up the data, and two because it is a single two-dimensional plot. There might be many features in emission-line space sitting undiscovered in the data; there might be many sub-classes and rich structure within the star-forming and AGN groups.

From a data perspective, times have really changed since BPT: (1) There are dozens (well, a dozen) of visible lines in hundreds of thousands of spectra. (2) We have good noise models for the line measurements and this is especially important when they get low in signal-to-noise (as they do if you want to use many lines. (3) We have very well-calibrated spectra now, even spectrophotometrically good to a few percent in the SDSS. (4) The effects of dust attenuation are pretty well understood in the optical. So let's go high dimensional and find all the complex structure that must be there!

The first step is to measure all the lines in a long list, and measure them even when the signal-to-noise is low. We don't care about detections we care about measurements with well-understood noise. The second step is to develop dust-insensitive metrics: What is the distance in data space between two sets of dust-line measurements as a function of noise but marginalizing out the dust affecting each spectrum? Now in that space, let's do some clustering.

I have done nothing on this except discuss it, years ago, with John Moustakas (Siena College). At that time, we were thinking in terms of generating archetypes with an integer program (with my now-deceased guru Sam Roweis). You could use things like support vector machines (great for these kinds of tasks) but we have no labels to classify on. The idea is to find classes not yet discovered! Also SVMs are not sensitive to the uncertainties in the data. I would recommend something like extreme deconvolution which does density estimation of the noise-deconvolved distribution. It can deal with very low signal-to-noise data gracefully. It would have to be modified, however, to project out (marginalize out) the dust-extinction direction in line space. Not impossible but not trivial either.

2012-08-26

what is the spectrum of dust attenuation?

The SDSS has taken spectra of thousands of F-type stars, at different distances and through different amounts of interstellar dust. These stars were chosen for calibration purposes; they were chosen because they have very well-understood and consistent spectra. These have been used to calibrate the SDSS telescope, but they can also be used to calibrate interstellar dust.

The general procedure would be to start by measuring the equivalent widths of a few absorption lines—preferably a couple of Balmer lines and a couple of metal lines— consistently for all F-stars. These line EWs would provide a dust-indpendent temperature and metallicity indicator for all the stars. Compare the spectra of the F-stars at different reddening but fixed absorption-line equivalent widths (and therefore fixed temperature and metallicity) to get the dust attenuation at resolution of a few thousand. There probably isn't anything interesting there, but if there is it would be a valuable discovery.

The easiest way to do this project is by spectral stacking, but there might be methods that build a non-linear model of the stellar spectrum with three controlling parameters: Balmer EW, metal EW, and SFD-dust-map amplitude. I started discussing this project many years ago with Karl Gordon (STScI); if you want to give it a shot, send us both email for ideas (if you want to; otherwise do it and surprise us!).

2012-08-25

javascript radio interferometry simulator

This isn't quite a full-blown research project, but it could evolve into one if done correctly. I want a multi-panel browser view, with one panel being an input panel in which I can arrange and set the brightnesses of point sources in an astronomical scene or a sky patch, and then a few panels showing the real part, imaginary part, amplitude, and phase of the fourier transform of the scene. It should all run in the browser for speed and flexibility. Also, there could be panels in which you set down antennae, get baselines in the u–v plane (possibly as a function of wavelength and time as the Earth rotates), and show also the dirty-beam reconstructed scene (and maybe also the clean-beam reconstructed scene). This could be used to develop intuition about radio astronomy and the fourier transform. If done right it could also be used to plan observations (indeed, it could have an ALMA mode where it knows about the ALMA antennae). If done really right it could be used to aid in data analysis.

2012-08-24

get SDSS colors and magnitudes for very bright stars

The SDSS saturates around 14th magnitude. However, (a) the gains are set such that the CCD pixels saturate before the analog-to-digial read-out saturates, and (b) the bleeding of charge on the CCD is essentially charge-conserving. Also, when very bright stars cross the readout register in the CCD, they leave a thin 2048-pixel line across the full camera column. And also also, the stars have well-defined diffraction spikes that are visible to large angular radii.

No-one says this is easy; this is a blog of good ideas not easy ideas: For one, the detector may become weakly nonlinear shortly before CCD pixel saturation; that is, the effective gain may be lower at brighter magnitudes; any project would have to look carefully into this, and you don't have variable exposure times to use (all of SDSS was taken at 55-second exposure time for very important reasons). For another, the shape and size of the diffraction spikes might be a strong function of position in the focal plane. However, I have hope, because the charge bleeds are so very very beautiful when inspected in detail.

Some prior work on this has been done by myself and Doug Finkbeiner (Harvard). It would be worth checking in with Fink before embarking.

2012-08-23

bimodality search or kurtosis components analysis

Take the SDSS spectra (which are beautifully calibrated spectrophotometrically) and interpolate them onto a common rest-frame (de-redshifted) wavelength grid. Do clever things to interpolate over missing and corrupted data where necessary; this might involve performing a PCA and using the PCA to patch and then re-doing PCA and so on. Then re-normalize the data so that the amplitudes of all the spectra are the same; I am being vague here because I don't know the best choice for definition of amplitude. This is all pre-conditioning for the data; in principle the recommendation here could be applied to any data set; I am just proposing the SDSS spectra.

Now search for a unit-norm (or otherwised normalized) eigenspectrum such that when you dot all pre-conditioned SDSS spectra onto the eigenspectrum, you obtain a distribution of coefficients (dot products) that has minimum kurtosis. That is, instead of finding the principal components—the components with maximum variance—we will look for the platykurtic components—the components with minimum kurtosis. If you are stoked, search the orthogonal subspace for the next-to-minimum kurtosis direction and so on.

Why, you ask? Because low-kurtosis distributions are bi-modal. Indeed, early experiments (performed by Vivi Tsalmantza (MPIA) and myself back in 2008) indicate that this will identify the eigenspectra that best separate the red sequence galaxies from the blue cloud. If you really want to go to town invent a bimodality scalar that is better than kurtosis.

One note: Optimization is a challenge. This sure ain't convex. My approach back in the day was to throw down randomly generated spectra, choose ones that happened to hit fairly low kurtosis, and optimize locally from those.

2012-08-22

cosmic-ray identification

Take a set of HST data from one filter and exposure time (to start; later we will generalize) that have been CR-split (meaning: two images at each pointing). Shift-and-difference these split images to confidently identify a large number of cosmic rays. Pull out 5x5 image patches centered on cosmic-ray-corrupted pixels and 5x5 image patches not centered on cosmic-ray-corrupted pixels. Use these labeled data as training data for a supervised method that finds cosmic rays in single-image (not-CR-split) data. Improve value of HST data for all and obtain enormous financial gift from NASA in thanks (well, not really).

Notes: The most informative pixel patches will be those with faint cosmic ray pixels and those with bright stars that mimic cosmic rays. Some of this work has been started with (now graduated) NYU undergraduate Andrew Flockhart.