This is a cross-post from Colper Science The original post is at http://blog.colperscience.com/2018/02/16/lariviere-talk/.
McGill will celebrated the Open Access Week International (October 23-29 2-17) with events designed to bring awareness to newer aspects of the scholarly communication lifecycle. Vincent Larivière, Canada Research Chair in the Transformations of Scholarly Communication and associate professor of information science at the École de bibliothéconomie et des sciences de l’information (l’Université de Montréal) gave a talk entitled “Scholarly communication and open access: what researchers should know”.
The complete presentation is available here and we recommend that you take a look at the slides while listening to the first half of this episode, which is the first section of Vincent Larivière’s presentation. We will discover how publishing in researcher became what it is today and an explanation of how having five big companies dominate this field affects researchers.
In the second part of this episode, we will present a short interview we had with Vincent Lariviere about an interesting event which happened here, at Université de Montreal, during the summer of 2017: the library’s university refused to renew one of the big deals it had with Taylor and Francis Group and had to go through tough negotiations with the publisher. Vinvent Lariviere was part of these negotiation and will tell us a little bit more about the role he played there.
A quick post to share an experiment, again, I did a few months again. This is a very “basic” test in which pull on a PMMA “dogbone” until it breaks. The specimen is being pulled at a constant loading speed of 2000N every minute. The stress strain plot is also presented here. An interesting fact to notice here is that the DIC method was able to capture the different in elastic and plastic deformation at the very end of the test (the very last point of the plot).
If you would like to learn more about the DIC method, please check one of my previous posts about it, like this one.
This test’s were of interest to some fellow researchers but I had no plan publishing them or anything, I was still interested in sharing these results and getting some “credit” for it (“credit”, in research, is usually understood as citations). This is, I believe, a pretty big issue for experimental researchers as only publications seem to matter in research, the Open Data for Experimental Mechanics project is actually focused on this issue. A solution for experimental researchers is to upload their dataset on the Zenodo platform. A DOI is then attributed to the dataset and the experimental results can be cited by other researchers, even if there is no paper attached to the experimental results.
A quick post to share an experiment I did back in 2016 in cooperation with Qinghua Wu. Qinghua’s research is mainly about 3D-printing chitosan, a natural polysacharride which is the structural element in the exoskeleton of shrimps and other crustaceans. Chitosan has several very nice properties, it is completely natural, it is biocompatible and is an antibacterial agent. However, it has the big disadvantage of only being able to sustain low loads especially when it is hydrated. Qinghua is also able to 3D print films of this polymer using the solvent cast 3D-printing method so we decided to investigate how a crack grows in this material.
We took a rectangular shaped film of chitosan, and covered it with a mist of black paint dots using and air gun. We then precracked it using a razor blade. The material is then inserted in a tensile testing machine and a stereocamera rig was setup in front of the tensile testing machine.
The black paint dots were used as a speckle pattern for the DIC algorithm after the test. It was thus possible to obtain the displacement and strain fields while the crack is growing.
This is a cross-post from opendataexpmechanics.github.io. The original post is at opendataexpmechanics.github.io. The purpose of this post is to introduce a new open source project I will be working on during the next year with Patrick Diehl.
Context and background
Patrick and I, Ilyass, met in March 2015 through Twitter. We started exchanging about Peridynamics, a novel theory for modelling of materials in mechanics. I was, and still am, extremely interested in that theory as it appeared to be a theory upon which a model could be built to perform simulation of some complex experiments I was working on. We started working around this theory, I went to Patrick’s lab in Germany as an invited researcher, and Patrick is now a fellow in the laboratory I am part of at Polytechnique Montreal.
In our every day work, I mostly do experimental work, I set up experiments involving several materials being pulled and broken apart while measuring displacements and loads; Patrick, on the other hand, mostly focuses on programming and implementing models to perform the best possible simulations of the experiments I setup in the laboratory. It is through this workflow that we started realizing that other publications about similar experiments did not contain enough information to completely model the experiment or reproduce it. The complete raw experimental data is usually not available, and, when it is possible to get the raw data, it is quite hard to work with since there is no standardized format to present that data. The experiments we do in the laboratory also started to appear less and less valuable, the only thing that appeared to be really valuable are the publications which could be written out of the experimental data.
A few weeks ago, we heard of the Mozilla Science Mini-Grants and decided to try to use this opportunity to setup a project addressing these issues. We worked hard to be able to send a grant demand to the Mozilla foundation and decided to share some of the answers we sent to the foundation in this blog, which will also be used to track the project’s progress.
The availability of experimental results is limited, because they are scattered in publications. The access to experimental data could be beneficial for mechanical engineers and computational engineers to improve their research. In publications details necessary to reproduce the experiment or design the simulation for a benchmark are often missing. A platform is therefore needed to share experimental data sets in mechanics, rating it with respect to reproducibility and quality of the experimental setup for benchmarks with simulation results. Thus, both communities could enhance the understanding of material behavior and fasten their research.
First project description
Accessible reliable and fully described experimental data is critically lacking in the materials/mechanics community to validate accurate predictive models. Our project proposes a platform for researchers to present their experimental results as standardized datasets. Experimental researchers can then obtain a DOI for their datasets, making them citable by others.
Coming up with a solution
We used the Open Canvas designed by the Mozilla Foundation to try to clearly define the project we would work on. The canvas helps linking a product, which solves a problem, to users and contributors. Contributors are critical for any open source project, which is why they are included in this canvas. The canvas summarizes the whole project’s purpose.
Experimental engineers will benefit from sharing their results by getting citations. It could be a motivation to provide their data on our platform. Computational engineers could use this data as benchmarks for their simulations and rating the data could improve the quality and make it more valuable for the community.
Description of the platform
The current form we have in mind for the project is a web platform. The platform itself will be a repository for datasets for experimental mechanics and materials. Users will be able to upload a PDF document clearly explaining the experiment and the data format. The experiment’s raw data itself will be either stored on our servers, if the total size is not too large, or stored on a University’s servers and linked to them. Other users will be able to login and research through the datasets available by category, kind of tests, materials and other classifiers to be determined later. The data can directly be downloaded by the user.
- Phase 1 - Initial platform development
During the next months, we will firstly be working on developing an initial version of the platform. Once we have a first functional version, we will move to the second phase.
- Phase 2 - Platform/repository launch
The platform’s code will be made public on Github in our repository. The platform will also be officialy launched. At that point, we will start looking for contributors to help us develop the platform and include new features. We will start advertising the platform to try to get more experimental results available for the users.
- Phase 3 - Sustaining the platform and advertising it
During the last phase, we will keep working on the previously mentionned activities and will also start focusing on means to make the platform sustainable. We will also focus on monitoring the platform to measure and quantify the outputs to see if we are reaching the objectives we defined at the beginning of the project.
Colper Science a bi-weekly podcast about Open Science and its methods. Each episode is an interview with someone somewhoe related to Open Science. We believe that is today possible for researchers to fully migrate to the world of Open Science using tools and methods already available out there, but most of these tools, methods and possibilities remain unknown by most of the research community. Colper Science’s purpose is to let everybody know about these tools by sharing success stories around Open Science.
Listen to Episode 01:
I manufactured a standard dogbone specimen similar to the one shown in the picture below sometime last year. Instead of using a plain monofiber, I decided to embed a Carbon Fiber (CF) bundle in it. I found some CF in our laboratory at Polytechnique and worked on a method to manufacture the specimen I had in mind.
The setup used here is the same as the one presented in the post Optical microscope Digital Image Correlation. In the GIF clip above, the specimen is shown first, and the bundle of carbon fiber going through it are indicated with a metallic ruler I am holding. The next scene shows the microtensile testing rig installed under the Olympus confocal laser microscope, a specimen is shown while being tested. In the final scene, the computer screen plugged to the microscope is shown. It is possible to see a PTFE fiber in an epoxy matrice while interfacial debonding it starting to happen. That was the specimen I was testing at the moment I shot this clip.
The specimen was loaded in the microtensile testing machine, the whole rig is put under the confocal laser microscope after what the test started. These kind of tests take about 8 hours long: the specimen is loaded displacement by pulling on the microtensile testing machine by a step inferior to (\(100 \mu m \), the microtensile testing machine is then stopped and a picture is snapped with the confocal laser (a single picture takes about 3 minutes because of the confocal scanning process). The specimen is then pulled again and these steps are repeated until a crack is observed. The purpose of this test is to observe crack initiation and propagation inside a bundle of CF. For this specimen, I stopped the test when the crack was large enough to cover the whole field of view, but before it broke the specimen in two separate parts. I then used some photoelastic film again (similar to the one presented in this post) and glued it on both sides of the specimen. The photoelastic film reveals the strain field within a polymer, the more fringes are visible in an area the higher the strian field is (more explanation were provided in this post).
The result reveals the residual stresses remaining in the specimen after this test. It is possible to see how the CF bundle affected the strain field in its vicinity. The final purpose of this experiment is to somehow come up with a method to perform DIC on the images obtained using the confocal laser microscope.
These experiments were done with the help of Damien Texier and were done at École de Technologie Supérieur, Montréal.
Lately, I have been exploring methods to extract large amount of data from scientific publications as part of my work with Kambiz Chizari for the Colper Science project, the Mozilla Open Science Labs 2017 and some future works we intend to do with the Nalyze team.
In this post, I explore 3 packages offered by the Content Mine group: getpapers, ami and norma. These 3 packages should allow us to download large sets of papers about a certain subject, normalize the obtained data to better explore it and then start analyzing using basic tools such as word counts and regular expressions.
The first consists in getting the scientific papers, to do so, we need to get started with
You can get
getpapers in one of the ContentMine organization repositories,
clone the repo,
cd into the folder and use:
sudo npm install --global getpapers
to install the package. The package is used for: “getpapers can fetch article metadata, fulltexts (PDF or XML), and supplementary materials. It’s designed for use in content mining, but you may find it useful for quickly acquiring large numbers of papers for reading, or for bibliometrics.” (from the repo)
We are going to try to investigate Polycaprolactone (PCL) and FDM 3D printing. PCL is an interesting polymer because it is biodegradable and has a very low melting temperature (60\(^o\) C), which means that it can easily be remodeled by hand, simply by pouring hot water on it. The subject I am interested in here is FDM 3D printing of PCL, mostly for biomedical applications.
We are fist going to try using EuropePMC, a repository of open access scientific data (books, articles, patents…).
EuropePMC is attached to
PubMed, and I quickly realized that it seems to mostly host papers about medical and biomedical applications (that is why our research will be focused on biomedical applications).
The first thing we need to do is query
EuropePMC to obtain a list of papers which contain the words in our query. Since we would like to have the papers about
3D printing and
PCL, these are going to be the words in our query. The default API used by
EuropePMC, otherwise, it is also possible to use
The query we will try will be:
ilyass@ilyass-PC:~/Repositories/getpapers$ getpapers -q 'PCL 3D print' -n -o '3DPCL'
-nruns the query in no-execute mode, which means that nothing will actually be downloaded, but the number of results found will be returned.
-qis the query itself
-ois the output folder for the query
ilyass@ilyass-PC:~/Repositories/getpapers$ getpapers -q 'PCL 3D print' -n -o '3DPCL' info: Searching using eupmc API info: Running in no-execute mode, so nothing will be downloaded info: Found 57 open access results
Now, we simply need to use the same command again, without the
-n flag to download the results. It is necessary to add the
-x flag, which will download the full text article in the
.XML structured format (we need to analyze the data later). It is possible to add a
-p to the command to automatically download the PDF files for the request. When I tried it for this request, 8 papers among the 57 found had no PDF files available.
ilyass@ilyass-PC:~/Repositories/getpapers$ getpapers -q 'PCL 3D print' -o '3DPCL' -p -x
(It might take a while depending of how many papers your request yields)
Take a look at the data:
ilyass@ilyass-ThinkPad-X1:~/Repositories/getpapers/3DPCL$ tree 3DPCL/ ├── eupmc_fulltext_html_urls.txt ├── eupmc_results.json ├── PMC2657346 │ ├── eupmc_result.json │ └── fulltext.pdf ├── PMC2935622 │ ├── eupmc_result.json │ └── fulltext.pdf ├── PMC3002806 │ └── eupmc_result.json ...
eupmc_fulltext_html_urls.txtcontains the list of URLs for all articles
eupmc_results.jsoncontains the result from the API, it is the best place to start exploring the data. Each paper is a
JSONobject, with an author, abstract, etc…
- Then, there is a folder for each paper, and each folder contains a
eupmc_result.jsonwhich is basically the
JSONobject from the master
eupmc_result.jsonfile. It will also contain
full_text.xmlif you used the
Now that we have the data, we need to normalize it to ScholarlyHTML. Most APIs (wether it is PubMed,
EuropePMC or others) will return data in different structured formats.
ScholarlyHTML is a common format designed to explore data from any of these APIs.
To convert the data we have, we need norma. In order to install it, head to the releases on the Github repository and install the
.deb package on your Linux machine using:
sudo dpkg -i <norma.deb file>
Then use it with:
ilyass@ilyass-PC:~/Repositories/getpapers$ norma --project 3DPCL -i fulltext.xml -o scholarly.html --transform nlm2html
-iprovides the input file names in each folder in the project folder
-ois the desired name for the Scholarly output
Now if you take a look at your data again:
3DPCL ├── eupmc_fulltext_html_urls.txt ├── eupmc_results.json ├── PMC2657346 │ ├── eupmc_result.json │ ├── fulltext.pdf │ ├── fulltext.xml │ └── scholarly.html ├── PMC2935622 │ ├── eupmc_result.json │ ├── fulltext.pdf │ ├── fulltext.xml │ └── scholarly.html ├── PMC3002806 │ └── eupmc_result.json ...
There is a
scholarly.html file for each folder where there is also a
scholarly.html file is a document you can open and explore with your web browser.
The next step is now to analyze the data we have in such a way that it makes your future readings more efficient, faster and to find the papers you should really read throughly among all the ones we found. To do so, we are going to try the ami package. It is a collection of plugins designed to extract specific pieces of information called facts. Currently,
ami plugins appear to be optimized to extract facts about genes, proteins, agronomy, chemical species, phylogenetics, some diseases. There are no plugins for engineering or material sciences (yet), so we will use two basic plugins for now to try to get some insights about our data: word frequencies and regular expressions (regex).
But first, install it by downloading the latest
.deb package release here and install it using
sudo dpkg -i <.deb file>.
Now let’s run a basic word frequency plugin from
ami on our data:
ilyass@ilyass-PC:~/Repositories/getpapers$ ami2-word --project 3DPCL -i scholarly.html --w.words wordFrequencies
If you look at your data again:
3DPCL/ ├── eupmc_fulltext_html_urls.txt ├── eupmc_results.json ├── PMC2657346 │ ├── eupmc_result.json │ ├── fulltext.pdf │ ├── fulltext.xml │ ├── results │ │ └── word │ │ └── frequencies │ │ ├── results.html │ │ └── results.xml │ └── scholarly.html ...
results.htmlfile which shows the frequency of each word in each article using a cloud of words (the size of word depends of its frequency)
results.xmlwhich shows the occurence of each word:
<?xml version="1.0" encoding="UTF-8"?> <results title="frequencies"> <result title="frequency" word="and" count="370"/> <result title="frequency" word="the" count="278"/> <result title="frequency" word="for" count="106"/> <result title="frequency" word="printing" count="97"/> <result title="frequency" word="tissue" count="83"/> ...
The problem, as you might have noticed, is that common words (such as
the) are faking our results. We can get rid of them thanks to a stopwords list. I wrote a very simple python script which writes the stopwords from nltk/corpora to a text file which can then be used by
ami. The script can be found here, and its result, the text file containing all the stopwrods can directly be downloaded here.
We can launch again the word counter plugin with the stopwords:
ilyass@ilyass-PC:~/Repositories/getpapers$ ami2-word --project 3DPCL -i scholarly.html --w.words wordFrequencies --w.stopwords stopwords.txt
The results obtained this time are more interesting since they only contain relevant words:
<?xml version="1.0" encoding="UTF-8"?> <results title="frequencies"> <result title="frequency" word="tissue" count="53"/> <result title="frequency" word="cell" count="50"/> <result title="frequency" word="cells" count="46"/> <result title="frequency" word="ECM" count="43"/> <result title="frequency" word="Biomaterials" count="36"/> <result title="frequency" word="heart" count="33"/> <result title="frequency" word="mechanical" count="31"/> ...
We can now explore all the word frequency results by going through them with a data mining script written with respect to what we are looking for.
Let’s try now to explore the regex functionality provided by
ami. To do so, we need to create a
.XML file which contains the regular expressions we will use.
For this case, I will use a simple file which finds all occurences of:
- “PCL” or “Polycaprolactone” or “polycaprolactone”
- “FDM” or “Fused Deposition Modeling” or “Fused deposition modeling” or “fused deposition modeling”
The file I will create has to be respect the following format:
<compoundRegex title="3DPCL"> <regex weight="1.0" fields="PCL">([Pp]olycaprolactone)</regex> <regex weight="1.0" fields="PCL">(PCL)</regex> <regex weight="1.0" fields="FDM">([Ff]used\s[Dd]eposition\s[Mm]odeling)</regex> <regex weight="1.0" fields="FDM">(FDM)</regex> </compoundRegex>
The weight parameter influences the relative importance given to each match (I kept it at 1 for now), while the regex-query itself is provided between
() in each line.
This file should be saved in the folder of your project (in my case it is
3DPCL) and should be a
.xml file. In my case, I named it
pcl_fdm.xml. We can then use the regex plugin:
ilyass@ilyass-PC:~/Repositories/getpapers$ ami2-regex --project 3DPCL/ -i scholarly.html --r.regex 3DPCL/pcl_fdm.xml --context 40 60
--contextflag is convenient as it will provide each result us with 40 characters before the regular expression is found and 60 characters after it, allowing us to quickly evaluate how relevant each answer is
If we take a look at the results again after running the command:
3DPCL/ ├── eupmc_fulltext_html_urls.txt ├── eupmc_results.json ├── pcl_fdm.xml ├── PMC4709372 │ ├── eupmc_result.json │ ├── fulltext.pdf │ ├── fulltext.xml │ ├── results │ │ ├── regex │ │ │ └── 3DPCL │ │ │ └── results.xml │ │ └── word │ │ └── frequencies │ │ ├── results.html │ │ └── results.xml ...
We now have results for our regex query too. If we open a result:
<?xml version="1.0" encoding="UTF-8"?> <results title="3DPCL"> <result pre="" name0="PCL" value0="PCL" post="is similar to PLA and PGA but it has a much slower" xpath="/html/body/div/div/p"/> <result pre="ote osteoblast growth and maintain its phenotype, " name0="PCL" value0="PCL" post="scaffold has been used as a long-term implant in t" xpath="/html/body/div/div/p"/> <result pre="981; Rich et al. 2002). However, the synthesis of " name0="PCL" value0="PCL" post="with other fast-degradable polymers can tune degra" xpath="/html/body/div/div/p"/> <result pre="inting technologies [i.e. 3D printing (3DP)], and " name0="FDM" value0="FDM" post="are most widely used for the construction of tissu" xpath="/html/body/div/div/p"/>
For each result, we get the context in the
post fields, the
name0 which are related to the regex query we created and the
xpath, which is the exact position of the sentence in the HTML tree of the
ScholarlyHTML file attached to this article.
This result might seem a bit rough on the eyes, but
.XML is a structured format which can be explored quite easily with
python for example.