Homework #10: Legal & Policy Issues

Who owns research data?

When research is funded by the federal government (which dominates the field of research funding to universities in the form of grants) the university that receives the funding owns the data, not the researcher. Data can’t be copyrighted, at least in the U.S., though the articles that result are, and those copyrights are retained by the writers/researchers themselves who submit for publication, subject to the university/funder limitations. No matter the fairness of the result of the clash between student Jessica Banks and Professor Brian Hayward, the professor is legally in the right.

How could this problem of access to the research notebooks and manuals have been avoided?

Through pre-clarification of who owns what data before the project commenced.

Under what conditions should copying of data been done?

Ideally in an atmosphere of collegial, professional courtesy, and a careerist sensitivity toward avoiding burning of bridges would have gone a long way in this case.

Consider the Wall Street Journal Alzheimer’s research story. Who are the stakeholders in this story? What legal and ethical claims can each make with respect to data ownership and access?

The stakeholders:

  • Alzheimer’s researcher Paul Aisen
  • The University of California San Diego (UCSD)
  • Funder Eli Lilly
  • The University of Southern California (USC).

Intellectual property issues in the age of the web remain fuzzily understood, but the law as applied in this case is fairly clear: Researcher Aisen and his new university home at USC have no claim to the data. Federal grants go to institutions (like research universities) not individual researchers at those universities, although if the researcher moves, the institution has the choice to let the funding follow the researcher.  Clearly this didn’t happen with UCSD and USC.

Identify at least 3 issues which may limit your ability to share data. Then, in your blog, create a list of all potential data sharing restrictions.

  • Pre-publishing embargos by journals.
  • National security issues.
  • Privacy and confidentiality issues involving human subjects, such as those that might trigger concerns from an Institutional Review Board.
  • Personal health data (HIPPA) and some other forms of data about individuals are protected by federal law. Student records as well.
  • Data on endangered species, sometimes.
  • Export control issues when potentially sharing data with other countries.
Advertisements

Homework #8: Data Citation

Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications.

1.Importance: What professional norms and practices, both of individuals and of institutions or organizations, support or undermine the idea that data are legitimate and citable products of research? How?

Studies indicate that despite talking a good game about citation in general terms, scholarly journals tend not to actually cite data sets very effectively, perhaps with title in the note field, sometimes with author, but rarely location for electronic retrieval. Persistent identifiers for data sets remain very rare.

2.Credit and attribution: Credit and attribution of more traditional types of research products is an established norm and practice; is extending this practice to include data a simple and natural thing to do? Why or why not?

Evidently not, based on the sparse examples in the field. It may take some consciousness raising among the academic disciplines. Also, the question of offering co-authorship to data set creators is an emerging question as open science and the web makes it easier to detach data sets from the papers they support and re-use them as independent resources, as Duke and Porter indicate.

3.Evidence: Citing literature to support claims is also an established practice; is extending this practice to include data a simple and natural thing to do? Why or why not?

Although things are changing, some data sets are hard to access and thus hard to cite, embedded in PDF’s or otherwise hard to access. A tradition of data citation has yet to be broadly established. Lack of use of persistent identifiers remains lacking.

4.Unique identification: Is it always possible for a data creator obtain a persistent identifier for their data set? Why or why not?

There are a couple of dozen services which provide “minting” of new and unique DOI’s for one’s data set (like EZID in the California Digital Library system) though they aren’t free. They also must be maintained by the data owner, if the resource jumps around from server to server over time, as organizations go under or otherwise undergo changes in access.

5.Access: In practice, do data citations always provide direct access to the dataset? Why or why not?

Meta-repositories like Dryad function as union catalogs for data sets, providing only metadata. Thus, the data set itself may not be accessible to every comer.

6.Persistence: In practice, do data citations (and metadata) persist beyond the lifespan of the data set? Should they? Why or why not?

It’s not an automatic process; data sets held by organizations must still be transferred if the hosting organization terminates. Also, not all data sets stand the test of time and may be considered not worth the effort of preservation. Some datasets may be superseded by newer superior versions.

7.Specificity and verifiability: Why is it important to be able to create and maintain specific and verifiable references to data sets, portions of data sets, or versions of data sets? What are some potential challenges to doing so?

Maintaining references to data sets makes their results more likely to be potentially reproducible (perhaps mitigating a major embarrassment in academic disciplines of late, with scientific studies unable to be replicated). Current technology often doesn’t allow sufficient “granularity” in the formal citation of subsets of larger data sets.

8.Interoperability and flexibility: What are some of the different stakeholder groups whose practices may influence the ability to support interoperability across citation standards and styles?

Publishers, universities, repositories, the writers/researchers themselves all have an interest.

 

What are three factors when considering whether acknowledgement, formal citation, or co-authorship is the most appropriate way to provide attribution to the creator of a data set used in a publication?

Does the data set creator want to be acknowledged in the paper? For example, if he disagrees with the conclusions of the paper’s author?

If the creator does want acknowledgement, will the journal do so? Many journals don’t have a formalized rule in citing the authors of data sets.

Is the data set vital to the paper – would it be unpublishable if the data set was removed? Is it a unique interpretation? Is it the sole source of the data of the paper? Or is the data set one of many similar used by the paper? There may be handicaps in the ability to cite the data, like a lack of a persistent identifier or lack of accessibility.

 

Data Citation

Ellison, Aaron; Bennett, Katherine (2009): Sarracenia Purpurea Prey Capture at Harvard Forest 2008. Long Term Ecological Research Network. http://dx.doi.org/10.6073/pasta/9a6105374adb15486b75cf621a2702dd

Authors (last name/first name); Publication date; Publication title (which in this case includes geographic location of project, description of purpose, and date);Digital Object Identifier; Organization.

Nepstad, D.C., E.A. Davidson, D. Markewitz, E.J.M. Carvalho, J.Q. Chambers, D. Ray, J.B. Guerrero, P. Lefebvre, L. Sternberg, M. Moreira, L. Barros, F.Y. Ishida, I. Tohlver, E.L. Belk, K. Kalif, and K. Schwalbe. 2012. LBA-ECO ND-30 Water Chemistry, Rainfall Exclusion, km 67, Tapajos National Forest. Data set. Available on-line [http://daac.ornl.gov] from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A.http://dx.doi.org/10.3334/ORNLDAAC/1131

Authors (last name/first and middle name initials for first one, with the order reversed for the rest, perhaps listed in descending order of contribution?); temporal data on project; description of project; scope limiters; geographic location of project; mention of “data set”; digital location for data access; geographical location for data access; Digital Object Identifier.

 

Creating Data Citations

Data Set #1

Chaneton, E.J., Tognetti, P.M. (2014): Community disassembly and invasion of remnant native grasslands under fluctuating resource supply. Inland Pampa, Buenos Aires. Dryad Digital Repository. doi:10.5061/dryad.46181

Data Set #2

Bret-Harte, M.S., Laundre, J., Mack, M.C., Shaver, G. (2011): Soil properties and nutrient concentrations by depth from the Anaktuvuk River Fire site in 2011. Latitude 68.99 Longitude -150.28, Alaska. Advanced Cooperative Arctic Data and Information Service. https://www.aoncadis.org/dataset/2011ARF_SoilCN_byDepth.2.html

Homework Activity #7: Metadata

Title of the Data Set: Effect of Temperature and Depth on Density of Conochilus unicornis and Conochilus hippocrepis in Littlevick Pond, Surrey UK in June 2010.

Time Period of Content: Date range (in YYYYMMDD format): 20100605-20100618

Keywords (theme): Zooplankton; Plankton;Rotifers; Conochilus unicornis; Conochilus hippocrepis; Population; Survey; Temperature; Density

Describe where you found the information that is needed to populate the metadata record.

For the title, I checked the slides, checked the Names file, confirmed the dates of the research, and tried to cram in all the “W” questions I could into a sentence, without the length getting out of hand or becoming overly technical and indecipherable.

For the dates, I went down the first column of dates in the pond2010 file and put them in the suggested date format YYYYMMDD.

For the keywords, I raided the “Names” sheet in the zoop-temp-main file, while checking the already used keywords under “place” and “temporal” to avoid repetition.

 

The metadata record some domain types noted as “unrepresentable.” What might that mean?

Attributes like temperature, colony diameter, and density are listed as “unrepresentable” because unlike depth, which can be quantified, limited, encompassed within a range of values (ie, .5 to 50cm), temperature, colony diameter and density theoretically have no upper or lower bounds.

 

Homework 4: Data Entry & Manipulation

1)Based on what you have learned so far about data management, what are some problems in the way the data is currently organized?

Pond2010 is confusing at first because it’s been sorted by species though the first column left to right is “date,” which repeats when switching the species of fish.

Column B is unidentified thought it mostly likely measures depth, and if it is depth, the unit of measurement (meters?) should be specified.

There is missing/unavailable data in rows 5 and 25 and unexplained yellow blocks instead of figures in two sections of temperature cells.

Title of the spreadsheet “Data” is uninformative.

All the temps are taken to a decimal point except “14” which appears several times. Unless the temperature was exactly 14.0 it’s probably too much for coincidence. Also, is this Celsius or Fahrenheit?

No column for “Colony Size” or “chla” (chlorophyll).

I am personally curious what Cuni and Chippo stand for and a web search provides no help, but if researchers in the discipline understand the terms then it’s probably acceptable. UPDATE: Never mind, a sheet attached to another file defines them. But why does only this file have a separate explainer?

Zoop-Temp

 The column “density” in Pond2010 is Chippo #/L and Cuni #/L here. This way might be more informative but consistency is best.

Also the order of the columns are inconsistent from Pond2010.

I would love to know what “chla” stands for. (Update: Chlorophyll)

The data cells set off from the main table are left unclear, though they appear to be averages from the dates of the measurements (June 7 and June 9). Perhaps they would be best located on a sum line at the end of each set of dates.

The X-axis on the scatter graph for Cuni density probably signifies depth but is not labeled. Also, why only a Cuni graph and not a Chippo graph? Perhaps the graphs could be put on a separate data sheet to avoid clutter and confusion.

Row 23 is logged in red text, but this is not explained.

Zoop-Temp Main

 This is a more coherent table, though the yellow block of cells without data has an undefined asterisk, and the scatter graph.

File naming inconsistency: “Rotifers” for the file containing Station B measurements, with Station B mentioned in the header, compared to “Station A” for the Station A file.

The scatter graph has no labels on the Y-axis and no explanation for its significance is given.

2) Suggest a new system for organization.

Change the file-naming convention for consistency and clarity.

Merge the June 2011 data into a single spreadsheet. Add a Station header to the spreadsheet.

Ask researcher what station(s) is represented for the June 2010 data, recode the data accordingly for the new Station column.

 Place the scatter graphs from the Zoop pages in two separate data sheets, and label them (unless this is a Filter function that is attached to the page?).

Place the metadata (or “key”) defining heading terms in a separate file.

 

3) Create a new spreadsheet that can be used as a template for later years of data.

(Send via email).

Homework #5: Data Quality Control & Assurance

1) Open the 3 files and inspect them. Plot the data to look for anomalous values, e.g. a scatter plot (x -y), or identify the maximum and minimum values in each column. These are easy ways to get a sense for where problems in the data may have been created, for example, where there have been mistakes in data entry or migrations, or problems with equipment.

Row 29 in the Zoop temp main file shows bad temperature data as shown in scatter graph (I noticed this one myself!) I should have caught the negative density values in Zoop temp but did not, I was looking mainly at the scatter graph which looked ok (and didn’t extend to negative numbers, understandably).

2)Suggest a system for flagging data as anomalous.

I would not have thought of creating a separate column to put in warning codes about dubious data – good idea! For instance, “M1”could stand for missing, “E1” for an estimate.

3) Suggest a system for flagging data as missing.

Again, the “flag” column is a good idea. One could simply leave the field blank, but my first idea would have been to fill in the missing data with a letter that stood for “missing data.”  If that’s not possible, one could use an “impossible” figure, something far outside the data range that would be obviously wrong. A “key” page to explain would also be necessary.

4) Suggest a system for data entry that can be used in the future to prevent data entry errors.

I like the idea of employing drop down menus whenever possible, to cut down on the possible variables entered in by severely limiting what is allowable data in a particular field, and thus cutting down on data entry errors. Maybe format cells to limit it to specific entries, i.e. numerical only. One could also have two people enter data in the cells independently, and then check the work for anomalies. If one of the pair doesn’t match, then you have an error.

LS590: Chapter 4 Summary

A raw-text summary of “Practices Do Not Make Perfect — Disciplinary Data Sharing and Reuse Practices and Their Implications for Repository Data Curation,” by Faniel and Yakel.

What is disciplinary data sharing? It’s how various disciplines, or academic subjects, share and reuse data within their respective fields.

The chapter is based on the Dissemination Information Packages for Information Reuse (DIPIR) project, which studies how data can best be created and preserved to support and encourage reuse.

It was an informal process until pretty recently. Now, there are expectations in academic disciplines to formally share data, via deposit into a repository. Federal agencies have developed policies to increase public access to federally funded research outputs. But the devil’s in the details; there has been limited guidance about what to share and how to share it.

The DIPIR project investigated data sharing and reuse practices within three academic communities: quantitative social science (or just plain old social science), archaeology, and zoology.

This chapter is a kind of overview of data sharing and data reuse practices in the those three disciplines. Not as critical as the headline, I think maybe they couldn’t resist the cute headline, but they do hint that one of the three discipline is weaker than the others in data sharing and reuse, and has more work to do than the other two, and I will leave you in breathless anticipation before I reveal the discipline.

The researchers focused on:

  • Disciplinary practices and traditions surrounding data sharing
  • The trust that those in the discipline have in the data they seek to reuse
  • Contextual information, in addition to the repository.

To perform the overview, the researchers interviewed staff and data reusers, and examined server logs, performed surveys, and watched zoologists interact with physical specimens.

This chapter underlines the usefulness of data sharing, why we do this in the fist place. For instance, in archaeology, one can reuse data from multiple sites in quantities larger than any one person could collect in a lifetime, one could examine regional social, economic, and cultural transitions between ancient civilizations. Zoologists could reuse data to address questions about extinction or migration events, and social scientists were integrating government and academic research data to study household economic trends over time.

SS seems to be the discipline best set up for data sharing. They have a half-century of data sharing and reuse in the discipline. Some examples are public opinion polls, social indicators. The federal government produces a ton of data. It’s a culture in Social Sciences, by now they’ve got their best practices down pat. They’ve had to deal with people’s privacy issues in a way the other disciplines don’t. Also, Social Sciences generally uses just a select few data formats, which makes data conversion simpler.

Archaeology is trickier. For one, they’ve tightened up ethics and rules in that discipline in recent times. You can’t remove cultural property from country of origin. More mandates. Must document artifacts on site can’t schlep them away and document them offsite. And publishers are more reluctant to shell out for big appendices and listings of artifacts and measurements. Sharing and reuse are hard in the field, because excavation destroys the context of field sites. In addition the discipline lacks common data recording practices, no consistent standards which harms networking. Archaeology comes off as the weakest field as far as data sharing and reuse.

Zoology were also seen as strong in data sharing and reuse infrastructure. From observational/taxonomic in past, to DNA today. Museums have collection managers to assist in curation. Darwin Core mentioned,  as part of the successful standardization of metadata in the field, has enabled a rich array of interconnected repositories with different metadata representations of the same specimen at various levels of granularity.

Data Reuse and Trust factors

A sense of trust for researchers toward the repository is important when encouraging reuse, trust both in the data and trust in the repository, whether or not it’s responsible and reliable. The researchers considered a variety of factors, which they called trust markers. These included: knowing the identity of the producer of the data; Documentation — making sure data collection is systematic. They found differences among the disciplines: Prior re-use of the data was important for social sciences, but not necessarily for A or Zoology. And the reputation of the repository was important for Social Science and Zoology, though Archaeology didn’t care.

Reusers use different types of contextual information about data during the data reuse process and that they use a variety of sources to get it, like peer-reviewed pubs, repository or museum records, data produced- generated records, People (Z), Documentation (SS), Codebooks (SS) and specimens/artifacts. Codebooks more of a Social Science thing, not really a thing for Archaeology which used maps, drawings, tables mostly instead.

Archaeologists typically used data producer publications to discover and access data and additional contextual information since sharing and reusing archaeological data were relatively new phenomena. The culture of Sharing and reuse are relatively new in the field. Zoologists use physical specimens more than Archaeology use physical artifacts. Fewer dedicated museum and repository staff for Archaeology. They concluded that Archaeology and Zoologists relied on original data production, SS on data reuse, how often the data was being reused. Mature culture of data sharing in Social Science, so that there was a kind of data set peer review.

The authors recommended that staff examine data producers’ practices during archaeological excavations to provide guidance on recording and managing data in ways that make things easier for the repository staff’s down the road so they provided efficient reuse of the resources.

As I noted, there were Interesting differences among the disciplines, especially between A & Z. SS & Z relied heavily on dedicated repository and museums. Not true for A. The authors recommended that repository staff center themselves in designed community of users, embedding to understand their needs and align with depository and curation activities.

Hands-on #2: Dataset Search

For this exercise, I looked up data on “Fisheries” and “climate change.” The “age of the universe” question from Exercise #1 wasn’t a good fit for these databases, which weighed toward the natural sciences.

Data Source #1: DataCite

Dataset citation: A citation was generated at the bottom of the page: re3data.org: ERDDAP; editing status 2018-01-26; re3data.org – Registry of Research Data Repositories. http://doi.org/10.17616/R35926 last accessed: 2018-06-08

Data Description: Title of data set: “California Fish Market Catch Landings, Long List, 1928-2002, Monthly.” One had the option of rendering this data set (and others listed) in graph or table form, with the options are listed as links in the spreadsheet. ERDDAP has a good information page describing the methodology of data collection, under the “Background” link.

Data Source repository: DataCite’s Registry has a clean graphical interface but the entire database seems to have only 2095 articles, which is a bit sparse. One can browse by subject, content type, or country. No search by title, only keyword. Search results come with tags based on those same browse topics. There is also post-search filtering available (faceting) by Subject, Content Type, Country of origin, depository type, and other options.

Method used to locate data set: The search of the DataCite registry on “fisheries” led me offsite to ERDDAP, the Environmental Research Division’s Data Access Program. At the link to ERDDAP, the repository itself, I tried “fisheries” again, after “fisheries climate change” again produced no results. It opened to a spreadsheet of 46 related datasets. At ERDDAP I looked up “Fisheries,” keyword “climate change” to no avail so stayed with “Fisheries.” Many were incomprehensible so I rooted out something somewhat explicable to me, “California Fish Market Catch Landings, Long List, 1928-2002, Monthly.”

How to access the data: DataCite used icons to indicate if a particular repository it worked with offered: open access; restricted access; closed access to its data. All data sets I encountered in ERDDAP itself were listed as publicly accessible.

 

Data Source #2: DataOne

Dataset Citation: It has an easy, click-to-copy citation function: Department of Fisheries and Wildlife, Michigan State University, Peter C. Esselman, Dana M. Infante, Lizhu Wang, William W. Taylor, et al. 2011. National Fish Habitat Action Plan (NFHAP) 2010 HCI Scores and Human Disturbance Data (linked to NHDPLUSV1) for Florida. USGS Science Data Catalog. 2874b337-ee23-4581-b557-c90fbaee52e8.

Data description: Title of Data Set: “National Fish Habitat Action Plan (NFHAP) 2010 HCI Scores and Human Disturbance Data (linked to NHDPLUSV1) for Florida.”

Data Source repository: Data One’s Data Search has a cool world map basically dividing the world into quadrants, or boxes, and counting up the data points available in each. There is no Advanced Search available. There are eight filtering facets available, including data attribute, creator, year. Summary of holdings show almost 800K data sets. However, I could not find any information about access level to the data, whether open or restricted.

Method used to locate data set: I used keyword “Fisheries,” and narrowed it down via the map-based “Location” facet to Florida. The metadata gave me an abstract, keywords, geographic info, people names, and more. That was the only access point I could find for the web url itself, which brought me to the U.S. Geological Survey, and a “shapefile,” (a term new to me) which “contains landscape factors representing human disturbances summarized to local and network catchments of river reaches for the state of Florida. This dataset is the result of clipping the feature class ‘NFHAP 2010 HCI Scores and Human Disturbance Data for the Conterminous United States linked to NHDPLUSV1.gdb’ to the state boundary of Florida.”

How to access the data: In the metadata section termed “Access Control,” there is a link to “read permission,” which was listed as Public for this and every other data set that came up in the search. I found this clearinghouse/repository combination the hardest of the three to navigate.

 

Data Source #3: Data Dryad

Dataset Citation: Citation information was generated automatically and prominently displayed with the search results as shown below:

When using this data, please cite the original publication:

Pauly D, Zeller D (2016) Catch reconstructions reveal that global marine fisheries catches are higher than reported and declining. Nature Communications 7: 10244. https://doi.org/10.1038/ncomms10244

Additionally, please cite the Dryad data package:

Pauly D, Zeller D (2016) Data from: Catch reconstructions reveal that global marine fisheries catches are higher than reported and declining. Dryad Digital Repository. https://doi.org/10.5061/dryad.4s4t1

Data description: Title of data set: “Catch reconstructions reveal that global marine fisheries catches are higher than reported and declining.” The basic metadata page included spatial, temporal, keywords, and Abstract. (Incidentally, the option for the full metadata display showed that Dryad uses the Dublin Core schema.) The underlying article itself, published at Nature, under the “Methods” section, noted the difficult nature of dealing with missing or unreliable fish-catch data. Instead of giving a “zero” for the number of fish caught, they made a rough estimate and explained their reasoning.

Data source repository: The Dryad Digital Repository had several post-search filters available, including subject and publication name.

Method used to locate data set: Simple keyword search on “Fisheries.” Advanced search is available, although the bucket of advanced search terms available in Dryad were too arcane to be of use to a tyro like me.

How to access the data: The package also contained a “Files in This Item” section that led to an Excel spreadsheet of three separate files, showing various figures for fish caught by years via Industrial, Commercial., Recreational fishing, etc.

Access was public, as shown by this boilerplate language used to describe the files in the package: “To the extent possible under law, the authors have waived all copyright and related or neighboring rights to this data.” A link to Open Definition also indicated open public access. However, Dryad also gives journals the option of making data privately available during peer review and lets submitters set limited-term embargoes post-publication.

Hands-on #1: Accessing Data in the Literature

I went to the UA library physics databases to answer the question, “What is the age of the universe?” I accessed a couple of the “Best Bet” suggested databases: The American Institute of Physics; ScienceDirect; SCOAP3.

Through those repositories I discovered these articles:

1) Citation: Yu, H., Wang, F. Y. “Reconciling the cosmic age problem in the Rh=ct universe.” EPJC 74 (2014) 3090. https://repo.scoap3.org/record/4205/files/main.pdf?subformat=pdfa

Data Type/Method of Data Access: Standard tables and scatter graphs embedded within the article. No way to “limit” for data availability. No data availability statement provided. No contact info for authors.

 

2) Citation: “Precious fossils of the infant universe.” Physics Today 65, 4, 49 (2012) https://physicstoday-scitation-org.libdata.lib.ua.edu/doi/full/10.1063/PT.3.1519

Data Type/Method of Data Access: A scatter graph and a supercomputer simulation of a supernova explosion were embedded in the article. No way to “limit” for data availability. No data availability statement provided. Contact info provided for one of the authors.

 

3) Citation: “The new model of the Big Bang and the Universe expansion. A comparison with modern observational data and cosmological theories.” A. N. Kraik and Kh. F. Valiyev.  AIP Conference Proceedings 2016 1770:1. https://aip-scitation-org.libdata.lib.ua.edu/doi/abs/10.1063/1.4963925

Data Type/Method of Data Access: Figures and equations embedded in the paper. No way to “limit” for data availability. No data availability statement provided. No contact information for authors.

 

4) Citation: “A Guided Inquiry on Hubble Plots and the Big Bang.” The Physics Teacher 52, 199 (2014); https://aapt-scitation-org.libdata.lib.ua.edu/doi/full/10.1119/1.4868929

Data Type/Method of Data Access:  x/y graphs and “slope of best fit” scatter graphs embedded in the paper. Also contained a separate link for “Figures,” which gathered those graphs separately. No option to “limit” for data availability. No data availability statement provided. Contact information was provided for the author.

 

5) Citation: “Computing accurate age and distance factors in cosmology.” American Journal of Physics 80, 367 (2012). https://aapt-scitation-org.libdata.lib.ua.edu/doi/full/10.1119/1.3698352

Data Type/Method of Data Access: The usual scatter plot graph and standard tables, but also: Instructions for student projects were made available as supplementary materials in the footnotes, and also as a link at the top under “Supplemental,” linking to an example class project for students who wanted to try and measure the age of the universe.

The paper also used data from, and recommended in the footnotes, Galaxy Zoo, a repository of galaxy images that relies on astronomer contributions, I believe? https://www.zooniverse.org/projects/zookeeper/galaxy-zoo/about/research

No way to “limit” for data availability. Contact information was provided for one of the authors.

 

6) Bonus: Feeling a bit demoralized with the lack of data sets, I acted on a hunch on the PLOS database based on class readings, and found an article about random sound sequences in another science database that was more robust. Perhaps tellingly, it was also recent, from 2018:

Citation: Skerritt-Davis B, Elhilali M (2018) “Detecting change in stochastic sound sequences.” PLoS Comput Biol 14(5): e1006162. https://doi.org/10.1371/journal.pcbi.1006162

It included a data availability statement, reproduced below:

Data Availability: Model code is available online at https://engineering.jhu.edu/lcap/index.php?id=software Experimental data is available at https://engineering.jhu.edu/lcap/index.php?id=research.

It also included an actual sound file under the link “Supporting Information.” It sounds a bit like Kraftwerk. http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006162#sec041

 

I ran the same search in the AIP database, but with a date limiter of before 2010. None of the top five articles from the search contained data availability statements.

How They Do It in Texas — brief look at DC core applied to a Texas Tech vs. Texas football game action photo

TT_vs_TPhoto courtesy of the Southwest Collection/Special Collections Library

Here’s a quick-hit survey of a college football database (Texas Tech Red Raiders, of Lubbock) which uses Dublin Core as its metadata schema. The photo above is titled [Football action shot from the TTU vs Texas game 1981-12]

Everything makes sense to me in the Title, save the “-12” part. It was the 7th game of the season, played on October 31, 1981, so it doesn’t seem to be a game number or a date. 12th photo taken from that game?

It’s interesting that it has three different data entries for the date field, with jargonish subtitles:

dc.date.accessioned 2014-09-11T18:52:12Z
dc.date.available 2014-09-11T18:52:12Z
dc.date.issued 1981-10-31

What could those mean? Via the Title, we notice that dc.date.issued is synonymous with the day of the game — a photo from the game. So perhaps issued means something like “creation date” here.

The other two may need some pondering, though one could guess 2014-09-11 is the day the photo was either digitized or put up on the web.

Also interesting that for “Subject,” they used Library of Congress style subject headings. Or are these official LC Subject Headings? My search of the Library of Congress website was unsuccessful in determining that.

dc.subject.lcsh Football–Texas–Lubbock–Photographs.
dc.subject.lcsh Texas Tech University–Football.
dc.subject.lcsh University of Texas at Austin–Football.

“Coverage” (the element I am “covering”) was absent, and they could have used it here, because one of the defects of this entry is a lack of geographic location – where the game took place. That’s a fairly big deal when dealing with a competitive tribal sport like this.

One can infer the game took place on the University of Texas home field (and can confirm that via several historical football scores websites online) because the Title includes the phrase “TTU vs Texas,” and knowing that by sports-writing custom, the home team is listed last. But that assumes prior experience with the subject.

Also (smaller defect), the name of the stadium at the time, Texas Memorial Stadium, is absent. (It is now called Darrell K. Royal — Texas Memorial Stadium.) Those facts were confirmed here, after a search on a website that may not be universally beloved among library studies but which rhymes with IckyPedia.