## Ilyass's books

Ilyass's favorite books »

Lately, I have been exploring methods to extract large amount of data from scientific publications as part of my work with Kambiz Chizari for the Colper Science project, the Mozilla Open Science Labs 2017 and some future works we intend to do with the Nalyze team.

In this post, I explore 3 packages offered by the Content Mine group: getpapers, ami and norma. These 3 packages should allow us to download large sets of papers about a certain subject, normalize the obtained data to better explore it and then start analyzing using basic tools such as word counts and regular expressions.

The first consists in getting the scientific papers, to do so, we need to get started with getpapers. You can get getpapers in one of the ContentMine organization repositories, clone the repo, cd into the folder and use:

sudo npm install --global getpapers

to install the package. The package is used for: “getpapers can fetch article metadata, fulltexts (PDF or XML), and supplementary materials. It’s designed for use in content mining, but you may find it useful for quickly acquiring large numbers of papers for reading, or for bibliometrics.” (from the repo)

We are going to try to investigate Polycaprolactone (PCL) and FDM 3D printing. PCL is an interesting polymer because it is biodegradable and has a very low melting temperature (60$$^o$$ C), which means that it can easily be remodeled by hand, simply by pouring hot water on it. The subject I am interested in here is FDM 3D printing of PCL, mostly for biomedical applications. We are fist going to try using EuropePMC, a repository of open access scientific data (books, articles, patents…). EuropePMC is attached to PubMed, and I quickly realized that it seems to mostly host papers about medical and biomedical applications (that is why our research will be focused on biomedical applications).

The first thing we need to do is query EuropePMC to obtain a list of papers which contain the words in our query. Since we would like to have the papers about 3D printing and PCL, these are going to be the words in our query. The default API used by getpapers is EuropePMC, otherwise, it is also possible to use IEEE or ArXiv.

The query we will try will be:

ilyass@ilyass-PC:~/Repositories/getpapers$getpapers -q 'PCL 3D print' -n -o '3DPCL' • -n runs the query in no-execute mode, which means that nothing will actually be downloaded, but the number of results found will be returned. • -q is the query itself • -o is the output folder for the query ilyass@ilyass-PC:~/Repositories/getpapers$ getpapers -q 'PCL 3D print' -n -o '3DPCL'
info: Searching using eupmc API
info: Found 57 open access results


Now, we simply need to use the same command again, without the -n flag to download the results. It is necessary to add the-x flag, which will download the full text article in the .XML structured format (we need to analyze the data later). It is possible to add a -p to the command to automatically download the PDF files for the request. When I tried it for this request, 8 papers among the 57 found had no PDF files available. Run it:

ilyass@ilyass-PC:~/Repositories/getpapers$getpapers -q 'PCL 3D print' -o '3DPCL' -p -x (It might take a while depending of how many papers your request yields) Take a look at the data: ilyass@ilyass-ThinkPad-X1:~/Repositories/getpapers/3DPCL$ tree
3DPCL/
├── eupmc_fulltext_html_urls.txt
├── eupmc_results.json
├── PMC2657346
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC2935622
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC3002806
│   └── eupmc_result.json
...

• eupmc_fulltext_html_urls.txt contains the list of URLs for all articles
• eupmc_results.json contains the result from the API, it is the best place to start exploring the data. Each paper is a JSON object, with an author, abstract, etc…
• Then, there is a folder for each paper, and each folder contains a eupmc_result.json which is basically the JSON object from the master eupmc_result.json file. It will also contain full_text.pdf and full_text.xml if you used the -p and -x flags.

Now that we have the data, we need to normalize it to ScholarlyHTML. Most APIs (wether it is PubMed, ArXiv, EuropePMC or others) will return data in different structured formats. ScholarlyHTML is a common format designed to explore data from any of these APIs. To convert the data we have, we need norma. In order to install it, head to the releases on the Github repository and install the .deb package on your Linux machine using: sudo dpkg -i <norma.deb file>

Then use it with:

ilyass@ilyass-PC:~/Repositories/getpapers$norma --project 3DPCL -i fulltext.xml -o scholarly.html --transform nlm2html • -i provides the input file names in each folder in the project folder • -o is the desired name for the Scholarly output Now if you take a look at your data again: 3DPCL ├── eupmc_fulltext_html_urls.txt ├── eupmc_results.json ├── PMC2657346 │ ├── eupmc_result.json │ ├── fulltext.pdf │ ├── fulltext.xml │ └── scholarly.html ├── PMC2935622 │ ├── eupmc_result.json │ ├── fulltext.pdf │ ├── fulltext.xml │ └── scholarly.html ├── PMC3002806 │ └── eupmc_result.json ...  There is a scholarly.html file for each folder where there is also a fulltext.xml. The scholarly.html file is a document you can open and explore with your web browser. The next step is now to analyze the data we have in such a way that it makes your future readings more efficient, faster and to find the papers you should really read throughly among all the ones we found. To do so, we are going to try the ami package. It is a collection of plugins designed to extract specific pieces of information called facts. Currently, ami plugins appear to be optimized to extract facts about genes, proteins, agronomy, chemical species, phylogenetics, some diseases. There are no plugins for engineering or material sciences (yet), so we will use two basic plugins for now to try to get some insights about our data: word frequencies and regular expressions (regex). But first, install it by downloading the latest .deb package release here and install it using sudo dpkg -i <.deb file>. Now let’s run a basic word frequency plugin from ami on our data: ilyass@ilyass-PC:~/Repositories/getpapers$ ami2-word --project 3DPCL -i scholarly.html --w.words wordFrequencies

If you look at your data again:

3DPCL/
├── eupmc_fulltext_html_urls.txt
├── eupmc_results.json
├── PMC2657346
│   ├── eupmc_result.json
│   ├── fulltext.pdf
│   ├── fulltext.xml
│   ├── results
│   │   └── word
│   │       └── frequencies
│   │           ├── results.html
│   │           └── results.xml
│   └── scholarly.html
...

• results.html file which shows the frequency of each word in each article using a cloud of words (the size of word depends of its frequency)
• results.xml which shows the occurence of each word:
<?xml version="1.0" encoding="UTF-8"?>
<results title="frequencies">
<result title="frequency" word="and" count="370"/>
<result title="frequency" word="the" count="278"/>
<result title="frequency" word="for" count="106"/>
<result title="frequency" word="printing" count="97"/>
<result title="frequency" word="tissue" count="83"/>
...


The problem, as you might have noticed, is that common words (such as and or the) are faking our results. We can get rid of them thanks to a stopwords list. I wrote a very simple python script which writes the stopwords from nltk/corpora to a text file which can then be used by ami. The script can be found here, and its result, the text file containing all the stopwrods can directly be downloaded here.

We can launch again the word counter plugin with the stopwords:

ilyass@ilyass-PC:~/Repositories/getpapers$ami2-word --project 3DPCL -i scholarly.html --w.words wordFrequencies --w.stopwords stopwords.txt The results obtained this time are more interesting since they only contain relevant words:  <?xml version="1.0" encoding="UTF-8"?> <results title="frequencies"> <result title="frequency" word="tissue" count="53"/> <result title="frequency" word="cell" count="50"/> <result title="frequency" word="cells" count="46"/> <result title="frequency" word="ECM" count="43"/> <result title="frequency" word="Biomaterials" count="36"/> <result title="frequency" word="heart" count="33"/> <result title="frequency" word="mechanical" count="31"/> ...  We can now explore all the word frequency results by going through them with a data mining script written with respect to what we are looking for. Let’s try now to explore the regex functionality provided by ami. To do so, we need to create a .XML file which contains the regular expressions we will use. For this case, I will use a simple file which finds all occurences of: • “PCL” or “Polycaprolactone” or “polycaprolactone” • “FDM” or “Fused Deposition Modeling” or “Fused deposition modeling” or “fused deposition modeling” The file I will create has to be respect the following format: <compoundRegex title="3DPCL"> <regex weight="1.0" fields="PCL">([Pp]olycaprolactone)</regex> <regex weight="1.0" fields="PCL">(PCL)</regex> <regex weight="1.0" fields="FDM">([Ff]used\s[Dd]eposition\s[Mm]odeling)</regex> <regex weight="1.0" fields="FDM">(FDM)</regex> </compoundRegex>  The weight parameter influences the relative importance given to each match (I kept it at 1 for now), while the regex-query itself is provided between () in each line. This file should be saved in the folder of your project (in my case it is 3DPCL) and should be a .xml file. In my case, I named it pcl_fdm.xml. We can then use the regex plugin: ilyass@ilyass-PC:~/Repositories/getpapers$ ami2-regex --project 3DPCL/ -i scholarly.html --r.regex 3DPCL/pcl_fdm.xml --context 40 60

• --context flag is convenient as it will provide each result us with 40 characters before the regular expression is found and 60 characters after it, allowing us to quickly evaluate how relevant each answer is

If we take a look at the results again after running the command:

3DPCL/
├── eupmc_fulltext_html_urls.txt
├── eupmc_results.json
├── pcl_fdm.xml
├── PMC4709372
│   ├── eupmc_result.json
│   ├── fulltext.pdf
│   ├── fulltext.xml
│   ├── results
│   │   ├── regex
│   │   │   └── 3DPCL
│   │   │       └── results.xml
│   │   └── word
│   │       └── frequencies
│   │           ├── results.html
│   │           └── results.xml
...


We now have results for our regex query too. If we open a result:

<?xml version="1.0" encoding="UTF-8"?>
<results title="3DPCL">
<result pre="" name0="PCL" value0="PCL" post="is similar to PLA and PGA but it has a much slower" xpath="/html[1]/body[1]/div[2]/div[6]/p[2]"/>
<result pre="ote osteoblast growth and maintain its phenotype, " name0="PCL" value0="PCL" post="scaffold has been used as a long-term implant in t" xpath="/html[1]/body[1]/div[2]/div[6]/p[2]"/>
<result pre="981; Rich et al. 2002). However, the synthesis of " name0="PCL" value0="PCL" post="with other fast-degradable polymers can tune degra" xpath="/html[1]/body[1]/div[2]/div[6]/p[2]"/>
<result pre="inting technologies [i.e. 3D printing (3DP)], and " name0="FDM" value0="FDM" post="are most widely used for the construction of tissu" xpath="/html[1]/body[1]/div[2]/div[6]/p[2]"/>


For each result, we get the context in the pre and post fields, the value0 and name0 which are related to the regex query we created and the xpath, which is the exact position of the sentence in the HTML tree of the ScholarlyHTML file attached to this article.

This result might seem a bit rough on the eyes, but .XML is a structured format which can be explored quite easily with python for example.

A simple note post to remember how to concatenate 2 PDF files in an extremely easy way.

Simply : sudo apt-get install pdftk, then use it with: pdftk file1.pdf file2.pdf cat output mergedfile.pdf

In a \documentclass{article} Latex document, the \usepackage{cite} will automatically compact your citations from [1, 2, 3, 4] to [1-4] for example (check here to learn more about it).

Now if you are using the \documentclass{elsarticle} to send your manuscript to an Elsevier journal, the previous solution does not work. The reason is that the elsarticle.cls class document is designed to work with natbib which uses the cite command and creates a conflict with \usepackage{cite}. A solution to get your compacted citations to work without \usepackage{cite} is provided in the elsarticle class documentation, add the following line before the document’s end:

\biboptions{sort&compress}


PS: I tested everything with bibtex.

If you liked this post, you can share it with your followers or follow me on Twitter!

Click on the start button then type CMD.

Input the following line then press Enter:

netsh wlan show profile name="ENTER_SSID_HERE" key=clear


If you liked this post, you can share it with your followers or follow me on Twitter!

### Formatting

Press the up arrow to edit your last message, then press Enter.

• Bold: *text*

• Italics: _text_

• ~Strikethrough~: ~text~

• Verbatim: insert the text in-between  marks

### Channels

You can mute any channel to stop getting notifications by typing /mute in the channel. Use it for channels that you only check from time to time. You will still be notified if someone directly mentions you.

You can leave a channel by typing /leave in the channel.

You can invite anyone to a channel simply by typing their username in that channel. You must then click on Invite.

Use the Do Not Disturb mode if you do not want to be disturbed at all for some time. Simply type /dnd 1 hour and you will not be bothered for the next hour.

To send a notification to all users in a channel, use @channel in your message. To send a notification to a specific user use @username. Use @here to send a notification to the presently active users in a channel (anyone using DND or absent will not be notified). Use @everyone to send a notification to every member of the team.

### Set reminders

Get started with: /remind help

Typical usage: /remind me to Call Tamir next Thursday at 3PM

You will receive a reminder at Thursday 3PM that you can either mark as completed, or ask Slackbot to remind you again some other time. You can setup reminders for anyone in the same Slack team using their username instead of me. You will be noitifed by SlackBot once that person marks the reminder as complete.

Check you existing reminders by typing: /remind list

Other examples:

/remind me to drink water at 3pm every day
/remind @jessica about the interview in 3 hours


### Tiny database

It is possible to setup custom SlackBot answers. These answers are triggered by specific keywords or sentences. Its most common usage is for the Wifi password. If a message contains both words wifi and password or key, then SlackBot should provide the WiFi key.

It can be used as electronic post-its to remember things…

### Integrations or apps

A large number of apps or integrations are available for Slack. An integration either adds functionality to Slack or plugs it to other services. The most common integration to expand functionality is Giphy; once installed, it is possible to type @giphy [text] in any channel to display a gif related to the [text] written.

Slack can also be connected to Github and Trello for example. In that case, any update in a Github repo will appear in a certain channel. Other users can immediately react and discuss the update, wether it is a Github update, a Trello one, Google Drive, etc…

If you liked this post, you can share it with your followers or follow me on Twitter!

Thanks to Patrick Diehl, I have been working at the Bonn University, more precisely at the Institut für Numerische Simulation in order to develop some new tools to better explore some of my experimental results through the Peridynamics framework (writing it in italics because I am still not sure that it is what we are actually doing, welcome to Research). While I am there, Patrick guided me installing Peridigm on my machine. In this blogpost, I will re-write the steps to do so in such a way that it is clear even for someone who is not familiar with the Linux environment. If you are pretty familiar with the Linux environment, you should head toward Patrick’s blogpost, which is more for experienced users.

First, you need a Virtual machine sfotware if you are on Windows, then grab the image of Fedora 23. Create a new Virtual Machine and install Fedora on the VM.

Once your machine is installed and working, you need to install a few packages. To do so, open a terminal and type:

sudo dnf install mpich-devel mpich netcdf-mpich-devel netcdf-mpich hdf5-mpich-devel hdf5-mpich netcdf-fortran-mpich-devel boost-mpich-devel boost-mpich blas-devel blas lapack-devel lapack gcc-c++


DNF is the equivalent of apt-get from Ubuntu in Fedora.

You will also need to get a group of softwares. This is a great feature of DNF which will get and install a bunch of packages necessary for a certain field or usage, in this case, you will need to:

sudo dnf  groupinstall "X Software Development"


Once you have the necessary packages, you will need to get an compile Trilinos on your machine. This guide considers that you are using Trilinos 12.6.2, you can get it here. Unpack Trilinos on your machine, and use the terminal to enter the unpacked folder. We are going to create a build folder inside it in order to compile it properly.

mkdir build && cd build


We will then need to load the previously installed mpich package, which is necessary for multiprocessor computations. We need it to compile Trilinos:

module load mpi/mpich-x86_64


Now, let’s create an empty text file in the previously created build folder and paste a list of instruction for the C compiler, explaining how it should actually compile Trilinos. This file should be called build.sh:

cmake \
-D CMAKE_INSTALL_PREFIX:PATH=/home/ilyass/trilinos-12.6.2-source \
-D CMAKE_CXX_FLAGS:STRING="-O2 -ansi -pedantic -ftrapv -Wall -Wno-long-long" \
-D CMAKE_BUILD_TYPE:STRING=RELEASE \
-D Trilinos_WARNINGS_AS_ERRORS_FLAGS:STRING="" \
-D Trilinos_ENABLE_ALL_PACKAGES:BOOL=OFF \
-D Trilinos_ENABLE_Teuchos:BOOL=ON \
-D Trilinos_ENABLE_Shards:BOOL=ON \
-D Trilinos_ENABLE_Epetra:BOOL=ON \
-D Trilinos_ENABLE_EpetraExt:BOOL=ON \
-D Trilinos_ENABLE_Ifpack:BOOL=ON \
-D Trilinos_ENABLE_AztecOO:BOOL=ON \
-D Trilinos_ENABLE_Amesos:BOOL=ON \
-D Trilinos_ENABLE_Anasazi:BOOL=ON \
-D Trilinos_ENABLE_Belos:BOOL=ON \
-D Trilinos_ENABLE_ML:BOOL=ON \
-D Trilinos_ENABLE_Phalanx:BOOL=ON \
-D Trilinos_ENABLE_Intrepid:BOOL=ON \
-D Trilinos_ENABLE_NOX:BOOL=ON \
-D Trilinos_ENABLE_Stratimikos:BOOL=ON \
-D Trilinos_ENABLE_Thyra:BOOL=ON \
-D Trilinos_ENABLE_Rythmos:BOOL=ON \
-D Trilinos_ENABLE_MOOCHO:BOOL=ON \
-D Trilinos_ENABLE_TriKota:BOOL=OFF \
-D Trilinos_ENABLE_Stokhos:BOOL=ON \
-D Trilinos_ENABLE_Zoltan:BOOL=ON \
-D Trilinos_ENABLE_Piro:BOOL=ON \
-D Trilinos_ENABLE_Teko:BOOL=ON \
-D Trilinos_ENABLE_SEACASIoss:BOOL=ON \
-D Trilinos_ENABLE_SEACAS:BOOL=ON \
-D Trilinos_ENABLE_SEACASBlot:BOOL=ON \
-D Trilinos_ENABLE_Pamgen:BOOL=ON \
-D Trilinos_ENABLE_EXAMPLES:BOOL=OFF \
-D Trilinos_ENABLE_TESTS:BOOL=ON \
-D TPL_ENABLE_Matio=OFF \
-D TPL_ENABLE_HDF5:BOOL=ON \
-D TPL_ENABLE_Netcdf:BOOL=ON \
-D TPL_ENABLE_MPI:BOOL=ON \
-D TPL_ENABLE_BLAS:BOOL=ON \
-D TPL_ENABLE_LAPACK:BOOL=ON \
-D TPL_ENABLE_Boost:BOOL=ON \
-D CMAKE_VERBOSE_MAKEFILE:BOOL=OFF \
-D Trilinos_VERBOSE_CONFIGURE:BOOL=OFF \
-D Netcdf_INCLUDE_DIRS=/usr/include/mpich-x86_64/ \
-D HDF5_INCLUDE_DIRS=/usr/include/mpich-x86_64/ \
..


It is important that you modify the second line of this file by changing the absolute path to the location where you unpacked Trilinos on your machine. We will then change the authorization of the build.sh file to make it executable by your Linux user by typing chmod a+x build.sh and can configure the compilation of Trilinos by executing this file, just type: ./build.sh The Cmake compiler will now configure Trilinos for compilation, it will take some time…. After that, we can finally install Tirlinos, by typing:

make -j
make install


This is also going to take some time, but after it, Trilinos will be installed on your machine. Congrats! Next, we are going to compile Peridigm. We are directly going to pull the latest release of Peridigm from the Github repository of Peridigm. Go to your home folder by simply typing cd in the terminal, then input line after line:

git clone https://github.com/peridigm/peridigm.git
git fetch
git checkout release-1.4.1


You should now have a folder called peridigm in you home folder. We are now going to create a build folder in it (similar to what we did for Trilinos) to compile Peridigm:

cd peridigm
mkdir build
cd build


The next step is also similar to what we previously did with Trilinos, you will need to create a build.sh text file in the build folder and paste the following content in it:

cmake \
-D CMAKE_BUILD_TYPE:STRING=Release \
-D Trilinos_DIR:PATH=/home/ilyass/trilinos-12.6.2-source/lib/cmake/Trilinos/ \
-D CMAKE_C_COMPILER:STRING=/usr/lib64/mpich/bin/mpicc \
-D CMAKE_CXX_COMPILER:STRING=/usr/lib64/mpich/bin/mpicxx \
-D CMAKE_CXX_FLAGS:STRING="-O2 -Wall -ansi -pedantic -Wno-long-long -ftrapv -Wno-deprecated -std=gnu++11" \
..


Again, do not forget to change the absolute path on the third line of the previous script to the location where you installed Trilinos on your machine. We can now configure the Cmake compiler by executing the build.sh file:

chmod a+x build.sh
./build.sh


Wait for it…. And you can then install Peridigm:

make  -j



You should be done ! To execute Peridigm, you will need to type in your Terminal ./src/Peridigm when you are in the peridigm folder. And that’s it !

Now if you want to be able to run Peridigm from any folder in your machine, you need to modify your .bashrc to tell it where the Peridigm executable is:

cd
nano .bashrc


And add the following line at the end of the file:

export PATH=\$PATH:/home/ilyass/peridigm/src/


Restart your terminal, and you’re done ! You can now type the Peridigm` command anywhere in your temrinal and it will call out the software.

Many thanks to Patrick Diehl :)

If you liked this post, you can share it with your followers or follow me on Twitter!

Just sharing a timelapse of a 18 hours print. The part is the Bird Feeder v4, designed by Printed nest. The print was done on a Leapfrog Creatr, with a layer height of 0.3mm and 80% infill. The Printed Nest team recommends a 100% infill but I try to avoid printing solid parts on printers that are not equipped with an active cooling system (which is the case of the Leapfrog Creatr). Else it is necessary to print very slowly in order to let the previous solid layer cool down before laying the next one on it.

The filament I used is from the PLA Opaque Series. Affordable, no knots in the spool, it flows great, the color is good enough an they have several afety certificatons on their Material Saety Data Sheet. If you woul like t get some, take a look at 3D-TRIP or even more intereting prices around the Montreal area.