org-R is an org-mode extension that performs numerical computations and generates graphics. Numerical output may be stored in the org buffer in org tables, and the input can also come from an org table. Rather than starting off by documenting everything systematically, I'll provide several commented examples. Towards the end there are lists of available actions and other options.
Although, behind the scenes, it uses R, you do not need to know anything about R. Common operations are provided `off the shelf' by specifying options on lines starting with #+TBLR:. Having said that, org-R also accepts raw R code (TBLRR: lines). For those who don't yet know R, but think they might be interested, try the showcode:t option. It displays the R code corresponding to the action you requested, and so provides a good starting point for fine-tuning your analysis. But that's getting ahead of things.
My hope is, of course, that this will be of use to people. So at this stage any comments, ideas, feedback, bug reports etc would be very welcome. I'd be happy to help anyone that's interested in using this, via the Org mailing list.
The code is currently here. Soon it will be in the contrib directory. The other things you need are R (Windows / OS X binaries available on the R website; widely available in linux package repositories) and the emacs mode Emacs Speaks Statistics (ESS). ESS installation instructions are here. Personally, under linux, I have something like
(add-to-list 'load-path "/path/to/ess/lisp") (require 'ess-site)
org-R uses two different option lines to specify an analysis/plot: #+TBLR: and #+TBLRR:. #+TBLRR: is the one that accepts R code, so we'll ignore that for now. To make the action happen, use M-x org-R-apply with point in the #+TBLR: line. That's the only function you need, and it would make sense to bind it to some key. So, first example.
Here's a command to tabulate the values in the second column. Issue M-x org-R-apply in the following #+TBLR line.
That results in
. So the values in column 2 were tabulated as requested. However, the original data got overwritten. That leads us to
We can specify input data for analysis/plotting in 3 different ways:
Case (3) is what happened above -- the input data came from a table immediately above the #+TBLR line. The default behaviour is to replace any such table with the output; this allows us to tweak the option line and update the analysis. However, normally we'll want to separate the data from the analysis output. So let's keep the data as a named table in the org file, and refer to it by name:
[arbitrary other content of org buffer]
which results in
Note that this time we did a different analysis: I removed the columns:2 option, so that tabulate was passed the whole table. As a result the output contains counts of joint occurrences of values in the two columns: out of the 4 possibilities, the only one we didn't observe was "B in column 1 and A in column 2". We could have achieved the same result with columns:(1 2). (But don't try to tabulate more than 2 columns: org does not do multi-dimensional tables).
At the risk of this starting to sound like a bad and boring undergraduate statistics textbook, the sort of plots that are appropriate depend on the sort of data. Let's divide it up as
The available off-the-shelf actions are listed here.
We're going to need some data. So let's prove that org can also
speak statistics and use org-R to simulate the data. This
requires some raw R code, so skip this bit if you're not
The following #+TBLRR line simulates 10 values from a Normal distribution with mean -3, and 10 values from a Normal distribution with mean 3, and lumps them together. The point is that the numbers we get should be concentrated around two different values, and we should be able to see that in a histogram and/or density plot.
Here's what I got. Note that the title: option set the name of the table with "#+TBLNAME"; we'll use that to refer to these data.
Now to plot the data. Let's have some colour as well, and this time the title: option will be used to put a title on the plot (and also to name the file link to the graphical output).
[Note that you can use multiple TBLR lines rather than cramming all the options on to one line.]
An alternative would be to produce a density plot. We don't have enough data points to justify that here, but we'll do it anyway just to show the sort of plots that are produced. This time we'll specify the output file for the png image using the output: option. (For the histogram we used output:"png". That's a special case; it doesn't create a file called "png" but instead uses org-attach to store the output in the org-attach dir for this entry. Same thing for the other available output image formats: "jpg", "jpeg", "pdf", "ps", "bmp", "tiff")
There were a couple of new features there. Firstly, I referred to column 1 using its column label, rather than with the integer 1. Secondly, note the use of the args: option. It takes the form of a lisp property list ("p-list"), specifying extra arguments to pass to the R function (in this case density()). Here we used it to set the line thickness (lwd=4).
The raw data, as collected by Manish, is in a table called org-variables-table, in a file called variable-popcon.org. We use the file: option to specify the org file containing the data, and the table: option to specify the name of the table within that file. [An alternative be to give the entry containing the table a unique id with org-id-get-create, refer to it with table:, and rely on the org-id mechanism to find it.].
Now we tabulate the data. (We're not currently taking the sensible step that Manish did of checking whether the variables were given values different from their default).
Rather than cluttering up this org file with all the count data, we'll store them in a separate org file:
We can see the top few rows of the table by using action:head
Here's a barplot of the counts. It makes it clear that over half the org variables are customised by only one or two users.
OK, let's make a bit more use of R's capabilities. We can use the org-variables data set to define distances between pairs of org users (how similar their customisations are), and distances between pairs of org variables (the extent to which people who customise one of them customise the other). Then we can use those distance matrices to cluster org users, and org variables.
First, let's create a table that's restricted to variables that were customised by more than four users. That's going to require a bit of R code:
Now let's make a table with a row for each variable, and a column for each org user, and fill it with 1s and 0s according to whether user j customised variable i. We can do that without writing any R code:
First we'll cluster org users. We use the R function dist to compute a distance matrix from the incidence matrix, then hclust to run a hierarchical clustering algorithm, and then plot to plot the results as a dendrogram:
And to cluster org variables, we use the transpose of that incidence matrix:
Please note that my main aim here was to give some examples of using org-R, rather than to show how the org variables data should be mined for useful information! The org-variables dendrogram does seem to have made some sensible clusterings (e.g. the clusters of agenda-related commands), but I'm going to leave it to others to decide whether this exercise really served to do more than illustrate org-R. Does anyone recognise any usage affinities between the clustered org users?
|In addition to the action: option (described here, th|
|following options are available|
|infile:/path/to/file.csv||input data comes from file.csv|
|infile:http://www.somewhere/file.csv||input data comes from file.csv somewhere on the web|
|infile:/path/to/file.org||input data comes from file.org; must also specify table with intable:|
|intable:table-name||input data is in table named with #+TBLNAME:table-name (in same buffer unless infile:/path/to/file.org is specified)|
|intable:table-id||input data is first table under entry with table-id as unique ID. Doesn't make sense with infile:/path/to/file.org|
|rownames:t||does first column contain row names? (default: nil). If t other column indices are as if first column not present -- this may change)|
|colnames:nil||does first row contain column names? (default: t)|
|columns:2 columns:(2)||operate only on column 2|
|columns:"wing length" columns:("wing length")||operate only on column named "wing length"|
|columns:((1)(2 3))||(when plotting) plot columns 2 and 3 on y-axis against column 1 on x-axis|
|columns:(("age")("wing length" "fierceness"))||(when plotting) plot columns named "wing length" and "fierceness" on y-axis against "age" on x-axis|
|action:some-action||off-the-shelf plotting action or computation (see separate list), or any R function that makes sense (e.g. head, summary)|
|lines:t||(when plotting) join points with lines (similar to action:lines)|
|args:(:xlab "\"the x axis title\"" :lwd 4)||provide extra arguments as a p-list (note the need to quote strings if they are to appear as strings in R)|
|outfile:/path/to/image.png||save image to file and insert link into org buffer (also: .pdf, .ps, .jpg, .jpeg, .bmp, .tiff)|
|outfile:png||save image to file in org-attach directory and insert link|
|outfile:/path/to/file.csv||would make sense but not implemented yet|
|height:1000||set height of graphical output in (pixels for png, jpeg, bmp, tiff; default 480) / (inches for pdf, ps; default 7)|
|width:1000||set width of graphical output in pixels (default 480 for png)|
|title:"title of table/plot"||title to be used in plot, and as #+TBLNAME of table output, and as name of link to output|
|colour:hotpink col:hotpink color:hotpink||main colour for plot (i.e. `col' argument in R, enter colors() at R prompt for list of available colours.)|
|sort:t||with action:tabulate, sort in decreasing count order (default is alphabetical on names)|
|output-to-buffer:t||force numerical output to org buffer (shouldn't be necessary)|
|inline:t||don't name links to output (so that graphics are inline when exported to HTML)|
|showcode:t||Display a buffer containing the R code that was generated to do what was requested.|
|*Actions that generate numerical output*|
|tabulate||count occurrences of distinct input values. Input data should be discrete. This is function table in R.|
|summary||summarise data in columns (minimum, 1st quartile, median, mean, 3rd quartile, max)|
|head||show first 6 rows of a larger table|
|transpose||transpose a table|
|Actions that generate graphical output|
|barplot||produces 'side-by-side' bar plots if multiple columns selected|
|plot||if only 1 column selected, index is automatic: 1,2,...|
|lines||same as plot|
|points||same as plot but don't join points with lines|
|density||like a smoothed histogram (i.e. a curve)|
|Grid of values|
|image||a grid image, with cells coloured according to their numerical values|
Apart from tabulate, the action: names are the same as the names of the R functions which implement them. `tabulate' is really called `table' in R.
Note that, in addition to the actions listed below, you can also use action:R-function, where "R-function" is the name of any existing R function. The function must be able to take a data frame as it's first argument, and must not require any further arguments (i.e. any further arguyments must have suitable default values). Any numerical output will be sent to the org buffer (use output-to-buffer:t to force this, although if that is necessary then that is a bug).
My aim with org-R is to provide a fairly general facility for using R with Org. The TBLR lines and TBLRR lines together specify an R function, which may take numerical input, and may generate graphical output, or numerical output, or both.
If any input data have been specified, then the R function receives those data as its first argument. The input data may come from an Org table, or from a csv spreadsheet file. In either case they are tabular (1- or 2-dimensional). The input data are passed to the function as an R data frame (a table-like structure in which different columns may contain different types of data -- numeric, character, etc). Inside the R function, that data frame is called 'x'. 'x' is also the return value of the R function. Therefore the numerical output of org-R is determined by the modifications to the variable x that are made inside the function (any graphical output is a side effect.)
It's worth noting that one mode of using org-R would be to write your own code in a separate file, and use the source() function on a TBLRR line to evaluate the code in that file.
Numerical output of the function should also be tabular, and may be received by the Org buffer as an Org table, or sent to file in Org table or csv format. R deals transparently with multi-dimensional arrays, but Org table and csv format do not.
Unless an output file has been specified, graphical output will be displayed on screen.
The mapping from the TBLR and TBLRR lines to the R function may benefit from further thought; currently what happens is that code corresponding to the TBLR line is generated, and then any explicit user code is appended to this. Thus the TBLRR lines have the 'last word' on the output. Since multiple, intermixed, TBLR and TBLRR lines can be given, it might make sense instead to follow the order of those lines when constructing the code.