Jekyll2022-12-13T15:23:26+00:00https://tinyecology.com/feed.xmlThe Rivers LabAdvancing agriculture by through the application of machine learning and microbiome scienceAdam R. Rivers, PhDThoughts on the silica hypothesis as a cause of Chronic kidney disease of unknown etiology (CKDu)2022-12-05T00:00:00+00:002022-12-05T00:00:00+00:00https://tinyecology.com/posts/silica-CKDu-sugarcane<p>Chronic kidney disease of unknown etiology is a devastating form of kidney disease that strikes primarily young agricultural workers working in sugarcane and rice production. For a background on this I’d recommend this piece by <a href="https://undark.org/2022/11/16/in-el-salvador-and-beyond-an-unsolved-kidney-disease-mystery/">Fletcher Reveley in Undark magazine</a> and this article in the New England Journal of Medicine from <a href="https://doi.org/10.1056/NEJMra1813869">Johnson et al. (2019)</a>. There are several competing hypotheses for the cause of the disease, including heat stress, exposure to agrochemicals or silica from pre-harvest burning, (done to reduce the biomass that needs to be transported to the mill). Silica exposure has been linked to kidney failure observationally <a href="https://doi.org/10.3109/0886022X.2011.623496">(Vupputuri et al. 2011)</a> and amorphous silica nanoparticles have been shown to cause kidney failure in a rat model <a href="https://doi.org/10.1152/ajprenal.00021.2022">(Sasai et al. 2022)</a>. We know that sugarcane workers are exposed to high levels of amorphous silica during burning <a href="https://doi.org/10.3390%2Fijerph17165708">(Schaeffer et al 2020)</a>.</p>
<p>My USDA-ARS research unit works with sugarcane breeders and so I wanted to offer some observations I have made from talking with them about this issue with them and hopefully be a connection to help medical and public health researchers find sugarcane breeders and agronomists who could help them get information on the biology and agronomic processes of the cropping system. I am not an expert on this at all but I’m posting these observations in the hope that I can help connect experts.</p>
<h1 id="changes-in-breeding-and-agronomic-practice">Changes in breeding and agronomic practice</h1>
<p>One of the most perplexing things about this epidemic is its rapid rise starting in about the 1990’s. During this time there was an effort to breed for increased size and lodging resistance (resistance to being blown over). There is some evidence that cultivars of sugarcane do vary in their silica leaf content <a href="https://doi.org/10.1080/01904169309364685">Deren et al. (1993)</a> and by the silica content of the soils. The silica content of cultivars is not something that would have been recorded in most breeding records but there has been a lot of selection for plant morphology that could be changing the silica content of the plants over time. It is also possible that changes in the morphology of the <a href="https://www.sciencedirect.com/science/article/pii/S092666902200615X">phytoliths</a> could be affecting the size and properties of the amorphous silica nanoparticles generated by burning.</p>
<p>Another change may be increased usage of calcium silicate and silicate-containing iron slag as a silicon fertilizer, particularly on land with tropical weathered soils <a href="https://doi.org/10.1007/s12633-020-00935-y">(Camargo and Keeping 2021)</a>. Silica fertilization had been noted to improve yield, drought resistance and pest resistance. I do not have good data on the increased use of silica fertilizer in specific locations but there seems to be increased recommendation for its use. Silicate bioavailability may also be changes secondarily by management changed that alter the pH of the soil.</p>
<h1 id="amorphous-silica-nanoparticles">Amorphous silica nanoparticles</h1>
<p>One of the criticisms of the silica hypotheses is that pulmonary silicosis does not cooccur with CKDu at high levels. There has been a large focus on crystalline silica, which causes silicosis. Some crystaline silica is present in sugarcane smoke and ash but for CKDu a bigger potential danger is pyrogenically formed amorphous silica nanoparticles. <a href="https://doi.org/10.1038/s41578-020-0230-0">Croissant et al. (2020)</a> have a helpful review article explaining how important the formation temperature size, shape, and impurities are to the toxicity of amorphous silica nanoparticles.</p>
<h1 id="alternatives-to--burning">Alternatives to burning</h1>
<p>Research is underway on green harvesting without burning and this practice is required in some countries. Burning reduces the amount of trash leaved transported to the mill by 4-8 tons per acre and makes it easier to manually harvest so it is the cheapest method of harvest. One alternative would be to make the “trash” more valuable by selling refined silica as a secondary, value-added product from milling. Research is ongoing to extract high grade silica from sugarcane bagasse <a href="https://doi.org/10.3390%2Fnano12132184">Seroka et al. (2022)</a> that would be profitable to sell.</p>
<p>If you are a researcher and want help connecting with sugarcane agronomists and breeders at USDA about this please reach out to me.</p>Adam R. Rivers, PhDChronic kidney disease of unknown etiology is a devastating form of kidney disease that strikes primarily young agricultural workers working in sugarcane and rice production. For a background on this I’d recommend this piece by Fletcher Reveley in Undark magazine and this article in the New England Journal of Medicine from Johnson et al. (2019). There are several competing hypotheses for the cause of the disease, including heat stress, exposure to agrochemicals or silica from pre-harvest burning, (done to reduce the biomass that needs to be transported to the mill). Silica exposure has been linked to kidney failure observationally (Vupputuri et al. 2011) and amorphous silica nanoparticles have been shown to cause kidney failure in a rat model (Sasai et al. 2022). We know that sugarcane workers are exposed to high levels of amorphous silica during burning (Schaeffer et al 2020).Teaming up with the UF Emerging Pathogens Institute2020-01-10T00:00:00+00:002020-01-10T00:00:00+00:00https://tinyecology.com/posts/epi<p>We arrived!</p>
<p>The University of Florida <a href="http://www.epi.ufl.edu/">Emerging Pathogens Institute</a> has graciously provided
space for our USDA group the biomathematics wing of EPI. The Institute has a unique, integrated focus on animal, plant and human health. The move gives lab members the opportunity to interact with microbiologists and statisticians working on many agriculturally relevant projects.</p>Adam R. Rivers, PhDWe arrived!Summer research in computational biology2019-04-03T00:00:00+00:002019-04-03T00:00:00+00:00https://tinyecology.com/posts/summer-positions<p class="notice--warning">This position has been closed, it is visible for archival purposes.</p>
<h3 id="the-agricultural-microbiomes-group-has-funds-available-for-summer-research-appointments">The Agricultural microbiomes group has funds available for summer research appointments</h3>
<p>We have several potential interesting projects for students with an interest in
machine learning, agriculture or microbiome science. Applicants should have
some experience programming or scripting, preferably in Python, R, or Javascript.</p>
<ul>
<li>
<p>Developing a model to update frost free dates to take into account climate
change and create a web application to help growers use the data to time the
planting of crops.</p>
</li>
<li>
<p>Develop a Vue.js/mongoDB/Flask web application to visualize sequencing quality control data</p>
</li>
</ul>Adam R. Rivers, PhDThis position has been closed, it is visible for archival purposes.AI for Agriculture2018-06-05T00:00:00+00:002018-06-05T00:00:00+00:00https://tinyecology.com/posts/ai-for-ag<p>To understand the impact of Artificial intelligence on Agriculture it is
important to put it in the context of larger economic changes in agriculture.
Since 1948 US agricultural output has grown at 1.48% annually and total factor
productivity have risen by about 1.38% annually with minimal growth in farm
inputs (USDA ERS 2018). While total farm input growth has remained flat, the
allocation of these inputs has shifted dramatically, with 4-fold declines in
labor inputs, modest declines in land input and increases in pesticide,
fertilizer and machinery use. This has taken place during a time of farm
consolidation that has led to fewer, better capitalized producers that are able
to invest in technological improvements.</p>
<p>Artificial intelligence is a field of computer science deals with creating
machines that can reason. Today the distinction is often made between broad AI
characteristic of human intelligence and narrow AI focused on one task like
labeling objects in pictures. Machine learning is a subset of AI that focuses on
having machines infer patterns from data that allow them to predict values or
classes. This area has seen incredible growth because in areas like computer
vision, machines have approached human performance and in other areas like
pattern recognition with high dimensional data they have exceeded it. Future
areas of AI may focus on the next layer of cognitive complexity; inferring
causality and reasoning through counterfactual arguments to deduce new ideas.</p>
<p>Artificial intelligence will disrupt and improve agriculture in many ways, but
as with many systemic technologies like steam, electricity, and rail, auto and
container transportation systems, the disruption will likely come from
efficiency gains in a wide range of processes at all levels of agriculture
rather than one breakthrough technology. These systemic changes are likely to
accelerate declines in farm employment, while increasing total farm output and
efficiency.</p>
<p>Currently AI development in agriculture is focused in a few key areas.</p>
<ul>
<li>
<p>Robotics, aided by advances in machine vision has allowed researchers to
create robots that can weed lettuce fields (Blue River Technologies) or pick
strawberries (CROO Robotics)—tasks that until recently could only be done by
humans.</p>
</li>
<li>
<p>Smart tractor technologies that guide a tractor and control the application
of seed and chemicals and measure yields at specific points in the field are
collecting valuable data that will enable the holders of that data to create
tremendous value by developing predictive models for how inputs should be used.
Currently that data is in private hands but is being aggregated by farm
equipment companies. By developing free, useful software tools for the farmer,
ARS could collect enough of that data to publicly advance the field and spur
grater innovation and insight from drone and satellite data.</p>
</li>
<li>
<p>A lot of work is being done developing models to take image data from
drones and turn it into recommendations of where to spray, fertilize or plant.</p>
</li>
<li>
<p>Soil and plant diagnostics. Work is being done to diagnose plant disease by
images and correlate soil chemistry and microbial contents with disease. Some of
this work involves AI.</p>
</li>
<li>
<p>Economic models can be enhanced to predict the effects of demand, planting
choices and weather to help farmers make decisions about what to plant and
planting/harvest timing
https://news.microsoft.com/en-in/features/ai-agriculture-icrisat-upl-india/.</p>
</li>
<li>
<p>IoT – What discussion of AI would be complete without its buzzword cousin
the internet of things? Environmental sensors deployed in fields will likely
yield insights that can increase productivity and decrease input use once
large-scale platforms to integrate and assimilate the data are in place and
enough data has been gathered to train models that create valuable predictions
from that data. Productivity will also be gained from simple IoT applications
like text alerts that an irrigation pump has failed or a gate that can be
remotely operated.</p>
</li>
<li>
<p>Crop improvement, AI is beginning to be applied to crop improvement in
several ways. Image recognition is improving phenomics, and allowing breeders to
scale up screens. The selection of genomic intervals for breeding or insertion
by CRISPR/CAS9 will speed up improvement. ML may be applied to complex
prediction problems like multi-trait selection.</p>
</li>
<li>
<p>Animal health – The behavior of animals (feeding, watering and movement)
from RFID chips) can be used to identify sick animals, like fitbits for pigs.</p>
</li>
</ul>
<p>In the Future AI may be focused on new areas including</p>
<ul>
<li>
<p>Selecting genomic regions for breeding to increase genetic diversity but
retain desirable traits in low diversity crops like citrus or grapes that are
susceptible to disease outbreaks.</p>
</li>
<li>
<p>Engineering microbial consortia to achieve specific outcomes like disease
suppression, drought tolerance or yield improvement.</p>
</li>
<li>
<p>Post-harvest processing improvements like FTIR spectroscopy and analysis of
grain to divert aflatoxin contaminated grain before it is mixed with other grain
or better supply chain control to reduce spoilage. I know Driscoll’s is using
machine vision to avoid backing bad berries.</p>
</li>
<li>
<p>The control of non-point source pollution. IoT sensors and valves can be
used to control runoff from tile drains and releases from waste lagoons on
concentrated animal feeding operations. This is particularly important for water
bodies like Lake Erie and the Gulf of Mexico. Investment in these technologies
will not likely proceed until regulations create market demand for non-point
source pollution control technologies.</p>
</li>
</ul>Adam R. Rivers, PhDTo understand the impact of Artificial intelligence on Agriculture it is important to put it in the context of larger economic changes in agriculture. Since 1948 US agricultural output has grown at 1.48% annually and total factor productivity have risen by about 1.38% annually with minimal growth in farm inputs (USDA ERS 2018). While total farm input growth has remained flat, the allocation of these inputs has shifted dramatically, with 4-fold declines in labor inputs, modest declines in land input and increases in pesticide, fertilizer and machinery use. This has taken place during a time of farm consolidation that has led to fewer, better capitalized producers that are able to invest in technological improvements.Cyberbiosecurity2017-12-22T00:00:00+00:002017-12-22T00:00:00+00:00https://tinyecology.com/posts/cyberbiosecurity<p>Our Office of National Programs at the USDA-ARS recently sent around an article
from <a href="https://doi.org/10.1016/j.tibtech.2017.10.012">Perccoud et al. (2018)</a> on
the risks to data and computer systems from security
holes in bioinformatics software. Perhaps most interestingly, it highlighted work
by <a href="https://www.usenix.org/system/files/conference/usenixsecurity17/sec17-ney.pdf">Ney et al. (2017)</a> where DNA was synthesized to exploit a software vulnerability during the processing of
sequence data.</p>
<p>It seemed like there were a
couple of messages for developers and users of bioinformatics software in these articles.
As biologists, It’s good to always keep in mind that data could be malicious and that
some of the software we use may not be secured against malicious data. However
we need to keep the response to the threat in proportion to 1) its probability of
occurring, 2) the value of the system affected, and 3) the value of the information to
others. For most of us experimental biologists the probability is low, our systems
are not valuable targets and our raw data is mostly valuable to us. The trust in our
community isn’t inherently bad, it also engenders the sharing of information that
speeds discovery.</p>
<p>A bigger immediate threat to most biologists is probably that we download lots of
software that we don’t investigate thoroughly, including unsigned binary versions of
software that could have been compromised by a third party, even if the developer
is trusted. That software itself may contain viruses.</p>
<p>It seemed like the take-away for bioinformatics programmers was that DNA
sequences are text strings and any software you write should “sanitize its input”
meaning not allow the text to execute arbitrary commands. When using low level
primitives that operate directly on memory extra care should be taken. As
developers we don’t know the ultimate end user of our software. One user out of
thousands may end up using it in a critical web application or for a high-stakes forensics
application.</p>
<figure>
<img src="https://imgs.xkcd.com/comics/exploits_of_a_mom.png" alt="https://www.xkcd.com/327/" />
<figcaption>DNA sequences and children's names must always be sanitized. XKCD #327, Exploits of a Mom.</figcaption>
</figure>
<p>In an effort to reduce security issues and increase reliability in my code, I
recently started using an automated code review service called
<a href="https://www.codacy.com">Codacy</a>. With each commit it
audits your code for security vulnerabilities and software “anti-patterns”. It caught
my unsafe handling of yaml config files (oops). These automated tools are a
good first start although, they are not a substitute for human code
review by security experts. Perhaps in the future those security resources will be
available to researchers at universities. For right now though, the risk to the typical
biologist is pretty low.</p>
<h1 id="references">References</h1>
<p>Peccoud, J., Gallegos, J.E., Murch, R., Buchholz, W.G., Raman, S. 2018.
Cyberbiosecurity: From Naive Trust to Risk Awareness. Trends Biotechnol.
36, 4–7. <a href="https://doi.org/10.1016/j.tibtech.2017.10.012">doi:10.1016/j.tibtech.2017.10.012</a></p>
<p>Ney, P., Koscher, K., Organick, L., Ceze, L., Kohno, T., 2017. Computer
Security, Privacy, and DNA Sequencing: Compromising Computers with
Synthesized DNA, Privacy Leaks, and More, in: 26th USENIX Security
Symposium USENIX Security 17. USENIX Association, Vancouver, BC, pp.
765–779.7 USENIX Security Symposium. url: <a href="https://www.usenix.org/system/files/conference/usenixsecurity17/sec17-ney.pdf">https://www.usenix.org/system/files/conference/usenixsecurity17/sec17-ney.pdf</a></p>Adam R. Rivers, PhDOur Office of National Programs at the USDA-ARS recently sent around an article from Perccoud et al. (2018) on the risks to data and computer systems from security holes in bioinformatics software. Perhaps most interestingly, it highlighted work by Ney et al. (2017) where DNA was synthesized to exploit a software vulnerability during the processing of sequence data.Efficient random access of Fasta data with Pyfaidx and pbgzip2017-12-05T00:00:00+00:002017-12-05T00:00:00+00:00https://tinyecology.com/posts/random-access-to-fastas<p>There are often times in bioinformatics when I find myself needing to sample
large fasta files randomly. RefSeq is currently about 250GB compressed; this
is not the sort of file you want your scripts making multiple passes through.</p>
<p><a href="https://peerj.com/preprints/970/">Shirley et al.</a> published a pre-print in
2015 announcing a the python package Pyfaidx. The idea behind the package is to
index fasta files in the faidx format used by Samtools, thereby allowing for
random seek-based access to the file. They have continued to maintain the package
and I’ve found it to be fast and pretty straightforward to use.</p>
<p><strong>Pyfaidx</strong></p>
<ul class="notice--info">
<li>Github: <a href="https://github.com/mdshw5/pyfaidx">https://github.com/mdshw5/pyfaidx</a></li>
<li>PyPi: <a href="https://pypi.python.org/pypi/pyfaidx">https://pypi.python.org/pypi/pyfaidx</a></li>
<li>Anaconda: <a href="https://anaconda.org/bioconda/pyfaidx">https://anaconda.org/bioconda/pyfaidx</a></li>
</ul>
<h2 id="getting-started">Getting started</h2>
<p>In its most basic usage you create a Pyfaidx fasta object:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pyfaidx</span>
<span class="n">genes</span> <span class="o">=</span> <span class="n">pyfaidx</span><span class="p">.</span><span class="n">Fasta</span><span class="p">(</span><span class="s">'tests/data/genes.fasta'</span><span class="p">)</span>
</code></pre></div></div>
<p>The file you import needs to have consistent length sequence lines or else
Pyfaidx cannot index it. If you need to fix your file I’d recommend
<a href="https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/reformat-guide/">Refomat from bbtools</a>.</p>
<p>By default Pyfaidx keeps the sequence name but throws away the description. You can
retain the description by adding <code class="language-plaintext highlighter-rouge">read_long_names=True</code>, but this option
only works with uncompressed input data.</p>
<h2 id="attributes">Attributes</h2>
<p>The Pyfaidx fasta object is a dictionary-like object with some caveats. I say
dictionary-like because the api is a bit different, for instance to access an
record we would need to put in an index range <code class="language-plaintext highlighter-rouge">genes['NM_001282543.1'][:]</code>
rather than <code class="language-plaintext highlighter-rouge">genes['NM_001282543.1']</code>. I would expect to use coordinates
only for the <code class="language-plaintext highlighter-rouge">.seq</code> attribute but it is a minor thing to remember.</p>
<p>It supports:</p>
<ul>
<li>Access by key</li>
<li>Slicing of sequences</li>
<li>Filtering based on keys</li>
<li>Common operations like reverse / complement</li>
<li>Support for Fasta Variants (which I did not test)</li>
<li>A command line interface in addition to the python package</li>
</ul>
<h2 id="more-random-access">More random access</h2>
<p>By default Pyfaidx returns keys in the order they were stored in the dictionary.
While this is not in any particular order if will return the same order with
each call if you need pseudorandom sampling you can simply randomize the
keys and call the records.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">random</span>
<span class="kn">import</span> <span class="nn">pyfaidx</span>
<span class="k">def</span> <span class="nf">shuffle_keys</span><span class="p">(</span><span class="n">fastaobj</span><span class="p">):</span>
<span class="s">"""take the Pyfaidx file and return a shuffled list of the keys"""</span>
<span class="n">keylist</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">key</span> <span class="ow">in</span> <span class="n">fastaobj</span><span class="p">.</span><span class="n">keys</span><span class="p">():</span>
<span class="n">keylist</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">key</span><span class="p">)</span>
<span class="n">kls</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">keylist</span><span class="p">)</span>
<span class="k">return</span> <span class="n">keylist</span>
<span class="n">genes</span> <span class="o">=</span> <span class="n">pyfaidx</span><span class="p">.</span><span class="n">Fasta</span><span class="p">(</span><span class="s">'tests/data/genes.fasta'</span><span class="p">)</span>
<span class="n">rand_keys</span> <span class="o">=</span> <span class="n">shuffle_keys</span><span class="p">(</span><span class="n">genes</span><span class="p">)</span>
<span class="k">for</span> <span class="n">key</span> <span class="ow">in</span> <span class="n">rand_keys</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">gene</span><span class="p">[</span><span class="n">key</span><span class="p">][:])</span>
</code></pre></div></div>
<h2 id="performance">Performance</h2>
<p>For benchmarking data is available in the
<a href="https://peerj.com/preprints/970/">preprint</a>. I found it to be fast enough
with access speeds that were comparable to sequential access or Biopython SeqIO
in-memory access, but without consuming memory. If you need to use the library
for sequential access there is also an option to read ahead into a
buffer: <code class="language-plaintext highlighter-rouge">genes = pyfaidx.Fasta('tests/data/genes.fasta', number=10000)</code>.</p>
<h2 id="reading-compressed-data">Reading compressed data</h2>
<p>If you are working with fastas large enough to warrant Pyfaidx chances are you
don’t want to unzip those files, plus gzipped files can be processed more
quickly for IO bound processes. Pyfaidx can read
<a href="https://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html">blocked gzip format</a>
used by Samtools but if can take a long time to convert a large fasta to .bfgz
format. Fortunately this can be done more quickly with pbgzip.</p>
<h2 id="parallel-blocked-format-gzip-pbgzip">Parallel blocked format gzip (pbgzip)</h2>
<p><a href="https://github.com/nh13/pbgzip">Pbgzip</a> is a multithreaded implementation of
bgzip. It compresses files in a fraction of the time required for the
non-parallel version. using pbgzip is simple.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">gzip</span> <span class="nt">-d</span> <span class="nt">-c</span> sequences.fasta.gz | pbgzip <span class="nt">-c</span> <span class="nt">-n</span> <span class="o">[</span>number_of_threads] <span class="nt">-6</span> <span class="o">></span> sequences.fasta.bfgz
</code></pre></div></div>
<p>Be sure to save a thread for gzip when you specify ‘-n’ for pbgzip. Now you have
a file compatible with Pyfaidx. you can delete your original file too since
.bfgz can be read by gzip.</p>Adam R. Rivers, PhDThere are often times in bioinformatics when I find myself needing to sample large fasta files randomly. RefSeq is currently about 250GB compressed; this is not the sort of file you want your scripts making multiple passes through.Computational Biology postdoc position2017-11-03T00:00:00+00:002017-11-03T00:00:00+00:00https://tinyecology.com/posts/postdoc-position<p class="notice--warning">This position has been filled, it is visible for archival purposes.</p>
<h3 id="the-agricultural-microbiomes-group-has-funded-a-postdoc-position-with-dr-chris-reisch-at-the-university-of-florida">The Agricultural microbiomes group has funded a postdoc position with Dr. Chris Reisch at the University of Florida</h3>
<p><a href="https://scholar.google.com/citations?user=Q3nC2m4AAAAJ&hl=en">Dr. Chris Reisch</a> and I are looking for a postdoc in computational biology to develop software and statistical methods to analyze data for high-throughput fitness-profiling experiments on enteric bacteria. The work has the potential to elucidate new regulatory mechanisms for virulence in <em>E. coli</em> and other medically and agriculturally important proteobacteria. The position is supervised by Dr. Chris Reisch at the University of Florida, Department of Microbiology and Cell Science and funded through a collaborative agreement with the US Department of Agriculture, Agricultural Research Service, Genomics and Bioinformatics Research Unit. The postdoc will work closely with Dr. Adam Rivers at the USDA on this project and will have the opportunity to participate in microbiome work with other USDA collaborators on agriculturally relevant projects.</p>
<p>We anticipate that a successful applicant will have an interest or experience in microbiology or microbial ecology. The wet-lab component of the project will be done by scientists in Dr. Reich’s lab. This position focuses on the development of applications for analysis of this data and the publication of software tools. A successful candidate will likely have experience in version control, test driven development, data structures, HPC’s, DevOps practices and proficiency in Python and R. The applicant should have a strong understanding of statistics, particularly compositional data and linear modeling. We recognize that each person has a unique mix of experience and individual strengths and we encourage any applicant with a strong record of research productivity and relevant experience to apply.</p>
<p>Contact us via email or apply <a href="http://explore.jobs.ufl.edu/cw/en-us/job/505250/post-doctoral-associate-computational-biology-in-microbiology">here</a>.</p>Adam R. Rivers, PhDThis position has been filled, it is visible for archival purposes.Tips on speeding up R2017-11-01T00:00:00+00:002017-11-01T00:00:00+00:00https://tinyecology.com/posts/speed-up-r<p>An ARS scientist asked me about ways to speed up his work in R last week, so I
thought I would post my answer here.</p>
<ul>
<li>
<p>Understand your bottleneck. Is it memory, or computation? Is there a particular point that’s slow? There are packages for
profiling your code to understand the bottlenecks like <code class="language-plaintext highlighter-rouge">pryr</code> and <code class="language-plaintext highlighter-rouge">lineprof</code> that
are useful for this. Optimizing for the sake of optimization is not optimal.
Find the slowest part of the code, optimize it and repeat until the code is fast
enough to get the job done.</p>
</li>
<li>
<p>Is it a package that is slow or your code? To take advantage of packages for
speeding up R you often need to rewrite functions. This can be done for
external packages but it requires an understanding of the package you want to
modify.</p>
</li>
<li>
<p>There a nice page on R optimization from the Chief scientist at Rstudio:
<a href="http://adv-r.had.co.nz/Performance.html">http://adv-r.had.co.nz/Performance.html</a>.</p>
</li>
<li>
<p>Stop using for loops and start using vector functions.</p>
</li>
<li>
<p>Use parallel cluster packages like Snow, Snowfall, or Parallel. But note that
these will only speed up packages designed to use them or your own code written
to take advantage of them. For instance, in your own code you can use <code class="language-plaintext highlighter-rouge">lapply</code> to
apply a function to a vector but with Parallel you need to use the command
parlapply.</p>
</li>
<li>
<p>Compile you program to byte code with packages like <code class="language-plaintext highlighter-rouge">JIT</code> or <code class="language-plaintext highlighter-rouge">Compiler</code></p>
</li>
<li>
<p>If you have one step that is slow rewrite the function if C++ using the package <code class="language-plaintext highlighter-rouge">Rcpp</code>.</p>
</li>
<li>
<p>If you are going something slow like Gibbs sampling or Markov chain Monte Carlo
be sure to use a package designed to do this like <code class="language-plaintext highlighter-rouge">Stan</code> or <code class="language-plaintext highlighter-rouge">Rjags</code>. The have
sampling routines coded in C++ that are much faster.</p>
</li>
<li>
<p>Is your problem embarrassingly parallel where the input data could be split into
different files and submitted at array jobs to different nodes on the cluster?
That could get you the speedup you need.</p>
</li>
<li>
<p>Under the hood R uses an open source linear algebra library called BLAS. There
is a faster proprietary intel Linear algebra library called Intel KML. The
Microsoft R open implementation of R ships with that Intel library. This could
speed your code up if the bottleneck is occurring on matrix operations.</p>
</li>
<li>
<p>For huge problems <code class="language-plaintext highlighter-rouge">Rspark</code> can use spark to perform map-reduce operations but
there’s a good bit of overhead required to setup a spark cluster and it is not
currently implemented on Ceres. It could be set up on AWS more easily.</p>
</li>
<li>
<p>There are packages to use GPUs with R (<code class="language-plaintext highlighter-rouge">gpuR</code>) but this should be the last thing
you try. There are much easier ways to obtain speedups and Ceres does not have
any GPU nodes.</p>
</li>
</ul>Adam R. Rivers, PhDAn ARS scientist asked me about ways to speed up his work in R last week, so I thought I would post my answer here.