getting metadata from pubmed and some comments

Discussion:

Aurélien Naldi

2007-11-06 14:59:01 UTC

Hi,

I'm a computer scientist by formation now working in bioinformatics.
As such I am dealing with tons of biology papers. I have found
referencer to be a great tool, the metadata fetching through crossref
is nice, but I never got more than the familly name of the first
author in the author field. Pubmed has much more complete metadata for
the papers I am currently dealing with, I would thus like to know if
adding support for pubmed into referencer is possible.

I have just looked at how to get metadata through pubmed, here is a
quick introduction:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax=20&retstart=0&term=<!your_search_term!>

pointing to this will give you a list of matches under this form:

<eSearchResult>
<Count>1</Count>
<RetMax>1</RetMax>
<RetStart>0</RetStart>
<IdList>
<Id>17581588</Id>
</IdList>
<TranslationSet>
</TranslationSet>
<TranslationStack>
<TermSet>
<Term>10.1038/nature05970[All Fields]</Term>
<Field>All Fields</Field>
<Count>1</Count>
<Explode>Y</Explode>
</TermSet>
<OP>GROUP</OP>
</TranslationStack>
<QueryTranslation>10.1038/nature05970[All Fields]</QueryTranslation>
</eSearchResult>

The important part is the IdList, it gives the list of PMID matching
with the search. To get more data on a particular entry, use this URL:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&rettype=citation&id=<!PMID!>

The result is a huge XML file with a real list of author, abstract,
and much more.
Some documentation (which I have not really read yet) is available at
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=coursework.section.brief

I have not seen an explicit search by doi but searching a doi does
work (without the "doi:" prefix)

AFAIK, PDFs unfortunatly don't include a PMID, but this gives much
better results than crossref for biology papers...

While I am at it, I have some (naive) questions about your XML format:
* Is it referencer-specific ?
* What are its advantages over bibtex XML (or other similar stuff) ?
It seems to deal better with "tags" (bibtexxml has keywords) and
pdffilenames (bibtexxml has only a relative path, when exported with
jabref) and to add the "manage_target" thing, that I do not use (yet).
Is it anything else ?
* I see only one "authors" field, which is way too "bibtex like" for
my taste. Having a clean separation of authors and being able to split
family name and given name looks nice to me.
Is it possible to extend the format to deal with this ?

And a final coment: some of my pdf files did not contain a doi entry,
when adding a whole directory, I got one error dialog for each of
them. It would be much more useful to remember the list of problematic
files and to show the list at the end of the process. Giving them a
"this thing need work" tag could be nice also, what do you think about
this ?

Thanks for your work on this nice tool!

Best regards.

--
Aur?lien Naldi

John Spray

2007-11-06 20:17:29 UTC

Permalink

Post by AurÃ©lien Naldi
I'm a computer scientist by formation now working in bioinformatics.
As such I am dealing with tons of biology papers. I have found
referencer to be a great tool, the metadata fetching through crossref
is nice, but I never got more than the familly name of the first
author in the author field. Pubmed has much more complete metadata for
the papers I am currently dealing with, I would thus like to know if
adding support for pubmed into referencer is possible.

More metadata sources would be a good thing, and are a key future
feature requirement. Crossref deliberately cripple their publicly
accessible OpenURL interface to provide only the author's last name, so
any future metadata code will move away from this.

Post by AurÃ©lien Naldi
I have just looked at how to get metadata through pubmed, here is a

That's very helpful, I will refer to it if/when I'm experimenting with
pubmed support.

Post by AurÃ©lien Naldi
* Is it referencer-specific ?

Yes, I made it up off the top of my head.

Post by AurÃ©lien Naldi
* What are its advantages over bibtex XML (or other similar stuff) ?
It seems to deal better with "tags" (bibtexxml has keywords) and
pdffilenames (bibtexxml has only a relative path, when exported with
jabref) and to add the "manage_target" thing, that I do not use (yet).
Is it anything else ?

I don't know bibtex xml. Referencer's format isn't intended to be
bibtex-specific, so a pure xml representation of bibtex wouldn't be
suitable.

Post by AurÃ©lien Naldi
* I see only one "authors" field, which is way too "bibtex like" for
my taste. Having a clean separation of authors and being able to split
family name and given name looks nice to me.
Is it possible to extend the format to deal with this ?

Yes, it would be. The Library of Congress MODS format implements this
for example. My main issue with this is the UI: does one then have
separate first name/last name/initials fields? It could get pretty
cluttered. I'm certainly open to suggestions in this area, since it's a
key point where bibtex-isms (curly braces {}) are necessarily exposed to
the user at present.

Post by AurÃ©lien Naldi
And a final coment: some of my pdf files did not contain a doi entry,
when adding a whole directory, I got one error dialog for each of
them. It would be much more useful to remember the list of problematic
files and to show the list at the end of the process. Giving them a
"this thing need work" tag could be nice also, what do you think about
this ?

Fair point. The "this thing needs work tag" could be an option on the
"here are the files that had problems" dialog. (There isn't an error
dialog for simply not finding a DOI code, so I guess you're talking
about the error when a DOI cannot be resolved to metadata by crossref)

Regards,
John

Aurélien Naldi

2007-11-06 20:35:36 UTC

Permalink

Post by John Spray

Post by AurÃ©lien Naldi
I have just looked at how to get metadata through pubmed, here is a

That's very helpful, I will refer to it if/when I'm experimenting with
pubmed support.

Glad to read this.
I think I can help with this...
Does referencer have (or plans to) a plugin system or something to add
metadata fetchers easilly ?

Post by John Spray

Post by AurÃ©lien Naldi
* Is it referencer-specific ?

Yes, I made it up off the top of my head.

I don't know bibtex xml. Referencer's format isn't intended to be
bibtex-specific, so a pure xml representation of bibtex wouldn't be
suitable.

I also do not think that a "bibtex, but in XML" is the best way to go,
and I definitively do not want to have to deal with this the bibtex
way... The UI is not trivial, but maybe not that important if the
metadata fetchers are good enough ;)
I'm not sure about the "initial" field, can't it be deduced from the
"given name" one ?
One annoying thing with having separated fields, is about copy/pasting
the whole list of authors. Maybe keeping a large field can be convenient
for this use case ?

Post by John Spray

oh, yes, this was before I realized I had to add username:password to
the crossref URL. I also had this with some pdf where the doi is
splitted on two lines (thus referencer only found the first half of it).
But I do think that putting a special tag on files without doi/metadata
is good.

Best regards

--
Aur?lien Naldi <aurelien.naldi at gmail.com>

John Spray

2007-11-06 22:10:04 UTC

Permalink

Post by AurÃ©lien Naldi
I think I can help with this...
Does referencer have (or plans to) a plugin system or something to add
metadata fetchers easilly ?

The interface for metadata-fetching code isn't well defined at present
but it shouldn't be too difficult to do, and I would appreciate
assistance with it. Here are some quick thoughts on the possibilities.

In the current situation, there are the following methods:

void BibData::guessDoi (Glib::ustring const &raw_)
void BibData::guessArxiv (Glib::ustring const &raw_)
void BibData::getCrossRef ()
void BibData::getArxiv ()

The guess methods scan the raw text for regexes that look like
identifiers. The get methods use the Transfer class to get the
necessary URLs and then use BibData::parseCrossRefXML and
BibUtils::parseBibUtils respectively to convert populate the BibData
with the downloaded metadata.

The guess functions are called from Document::readPDF when a PDF is
first added: this loads the raw text of a pdf and processes it
page-by-page (to avoid loading the whole thing when the identifier is on
the first page).

The get functions are called in Document::getMetaData, and the document
determines which kinds of metadata it can get in
Document::canGetMetaData.

Both of the existing mechanisms (arxiv and crossref) could be
implemented as descendants of a MetadataFetcher abstract class with
guess() and get() methods. However, for efficiency the guess() methods
should probably be combined into a global guessing function which uses
regexes provided by the MetadataFetcher implementations.

To manage N MetadataFetchers we would need at least

* Priority information: which are our favourite fetchers? Would
probably put things like pubmed and arxiv at the top and use
general DOI stuff like cross refas a backup. Preferably provide
UI for setting this.
* Enable/disable information, and associated UI so that plugins
which are broken for a given user or just spurious and wasting
time can be disabled.
* Hooks for the fetchers to provide preferences UI.
* Some mechanism for the fetchers to share identifiers: more than
one is going to understand DOIs. So perhaps this leads to a
IdentifierScanner class (for the guess methods) and a separate
MetadataFetcher class (for the get methods). Then need a
general way of specifying which fetchers can deal with which
identifiers.

Once the interface is well defined, a plugin system becomes possible.
But I think the first step is definitely to refine the interface within
the existing monolithic C++.

If you start hacking on this then feel free to grab me on google talk
for questions (jcspray attt gmail.com).

Post by AurÃ©lien Naldi
I'm not sure about the "initial" field, can't it be deduced from the
"given name" one ?

I guess one usually either has the initials or the first names (not
both), so it's not much of an issue.

Post by AurÃ©lien Naldi
One annoying thing with having separated fields, is about copy/pasting
the whole list of authors. Maybe keeping a large field can be convenient
for this use case ?

Or having in general an entry for adding authors (type "John Spray"
without changing fields) which then appear in a list automatically
parsed, such that the user can tweak them to his liking. The challenge
is doing it in a way that takes a minimal amount of space on the screen.

Cheers,
John

Aurélien Naldi

2007-11-07 08:39:28 UTC

Permalink

Post by John Spray
The interface for metadata-fetching code isn't well defined at present
but it shouldn't be too difficult to do, and I would appreciate
assistance with it. Here are some quick thoughts on the possibilities.
void BibData::guessDoi (Glib::ustring const &raw_)
void BibData::guessArxiv (Glib::ustring const &raw_)
void BibData::getCrossRef ()
void BibData::getArxiv ()
The guess methods scan the raw text for regexes that look like
identifiers. The get methods use the Transfer class to get the
necessary URLs and then use BibData::parseCrossRefXML and
BibUtils::parseBibUtils respectively to convert populate the BibData
with the downloaded metadata.
The guess functions are called from Document::readPDF when a PDF is
first added: this loads the raw text of a pdf and processes it
page-by-page (to avoid loading the whole thing when the identifier is on
the first page).
The get functions are called in Document::getMetaData, and the document
determines which kinds of metadata it can get in
Document::canGetMetaData.
Both of the existing mechanisms (arxiv and crossref) could be
implemented as descendants of a MetadataFetcher abstract class with
guess() and get() methods. However, for efficiency the guess() methods
should probably be combined into a global guessing function which uses
regexes provided by the MetadataFetcher implementations.
To manage N MetadataFetchers we would need at least
* Priority information: which are our favourite fetchers? Would
probably put things like pubmed and arxiv at the top and use
general DOI stuff like cross refas a backup. Preferably provide
UI for setting this.
* Enable/disable information, and associated UI so that plugins
which are broken for a given user or just spurious and wasting
time can be disabled.

These two things can easilly go together in the pref UI: a list with
"up" and "down" buttons and a checkbox on each row. Similar UI exist in
other gnome applications.

Post by John Spray
* Hooks for the fetchers to provide preferences UI.

Yes, having it in the UI would be nice, but I have no idea what this
should look like (a separate dialog, show them inside the main pref
dialog ? would each fetcher have to define its own UI or coult it be
somehow generic ?). In the meantime, it may be easier to start with
gconf-only prefs. A "metadata" subdirectory with keys prefixed by the
name of the metadata fetcher looks sane to me.

Post by John Spray
* Some mechanism for the fetchers to share identifiers: more than
one is going to understand DOIs. So perhaps this leads to a
IdentifierScanner class (for the guess methods) and a separate
MetadataFetcher class (for the get methods). Then need a
general way of specifying which fetchers can deal with which
identifiers.

You mentioned a Document::canGetMetaData function, maybe it should be
the other way around: always call the enabled fetchers on the document,
and let the fetcher decide if it can do something or not, depending on
what was found by the "getWhatever" functions ?
Also, I am all for the combined ID guesser, as many (?) fetchers may
reuse the same stuff. ID guessing regex could be shared through another
object giving the regex for a given type of ID. It should also be give
the web link for a given value of the ID. Then all fetchers just need to
say which one they need and the common ID guesser can use only the regex
required by the active fetchers. I'm not sure wether a fetcher should be
able to rely on several IDs.
Oh, this could even be used to know which ID a fetcher can use and allow
to call only the relevant ones (voiding my previous statement).

Related to this, I can only see a "doi" field, where is the arxiv ID
stored ? It may be nice to have a set of "ID" associated with the
document, each with a name and a value.

Post by John Spray
Once the interface is well defined, a plugin system becomes possible.
But I think the first step is definitely to refine the interface within
the existing monolithic C++.
If you start hacking on this then feel free to grab me on google talk
for questions (jcspray attt gmail.com).

I guess one usually either has the initials or the first names (not
both), so it's not much of an issue.

Post by AurÃ©lien Naldi
One annoying thing with having separated fields, is about copy/pasting
the whole list of authors. Maybe keeping a large field can be convenient
for this use case ?

If you want to gain some space, then showing only one large field if
better, but it has to be larger: some papers have MANY authors, editing
them in a small text field is painfull.
Showing the large field by default with a "detail" expander to reveal
the list of small ones may be a good compromise (the "real" data being
stored in a clean list and the large field being just an automatic
aggregation of it)

Some more things that I do not like:
* putting the proxy stuff in the pref dialog is probably not a good
thing, why not just a button to launch the corresponding gnome capplet ?
* I'm not fond of having the "property dialog" as a dialog. I would
prefer a panel inside the main window for this (maybe as I am used to
jabref ?)

Best regards

--
Aurelien

Leonardo Fontenelle

2007-11-06 23:33:53 UTC

Permalink

I'm not (yet) a regular Referencer user, but I'm very interested in
PubMed integration.

I was following the thread and noticed the "family name, given name"
part. I won't be able to say much about it, but I'd like to suggest
this reading:

http://rishida.net/blog/?p=100

Best regards!

Leonardo Fontenelle
http://leonardof.org

Post by AurÃ©lien Naldi
Hi,
I'm a computer scientist by formation now working in bioinformatics.
As such I am dealing with tons of biology papers. I have found
referencer to be a great tool, the metadata fetching through crossref
is nice, but I never got more than the familly name of the first
author in the author field. Pubmed has much more complete metadata for
the papers I am currently dealing with, I would thus like to know if
adding support for pubmed into referencer is possible.
I have just looked at how to get metadata through pubmed, here is a
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax=20&retstart=0&term=<!your_search_term!>
<eSearchResult>
<Count>1</Count>
<RetMax>1</RetMax>
<RetStart>0</RetStart>
<IdList>
<Id>17581588</Id>
</IdList>
<TranslationSet>
</TranslationSet>
<TranslationStack>
<TermSet>
<Term>10.1038/nature05970[All Fields]</Term>
<Field>All Fields</Field>
<Count>1</Count>
<Explode>Y</Explode>
</TermSet>
<OP>GROUP</OP>
</TranslationStack>
<QueryTranslation>10.1038/nature05970[All Fields]</QueryTranslation>
</eSearchResult>
The important part is the IdList, it gives the list of PMID matching
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&rettype=citation&id=<!PMID!>
The result is a huge XML file with a real list of author, abstract,
and much more.
Some documentation (which I have not really read yet) is available at
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=coursework.section.brief
I have not seen an explicit search by doi but searching a doi does
work (without the "doi:" prefix)
AFAIK, PDFs unfortunatly don't include a PMID, but this gives much
better results than crossref for biology papers...
* Is it referencer-specific ?
* What are its advantages over bibtex XML (or other similar stuff) ?
It seems to deal better with "tags" (bibtexxml has keywords) and
pdffilenames (bibtexxml has only a relative path, when exported with
jabref) and to add the "manage_target" thing, that I do not use (yet).
Is it anything else ?
* I see only one "authors" field, which is way too "bibtex like" for
my taste. Having a clean separation of authors and being able to split
family name and given name looks nice to me.
Is it possible to extend the format to deal with this ?
And a final coment: some of my pdf files did not contain a doi entry,
when adding a whole directory, I got one error dialog for each of
them. It would be much more useful to remember the list of problematic
files and to show the list at the end of the process. Giving them a
"this thing need work" tag could be nice also, what do you think about
this ?
Thanks for your work on this nice tool!
Best regards.
--
Aur?lien Naldi

Michele Mattioni

2007-11-07 00:11:07 UTC

Permalink

+1 to really cool pubmed integration in referencer ;)

Post by Leonardo Fontenelle
I'm not (yet) a regular Referencer user, but I'm very interested in
PubMed integration.
I was following the thread and noticed the "family name, given name"
part. I won't be able to say much about it, but I'd like to suggest
http://rishida.net/blog/?p=100
Best regards!
Leonardo Fontenelle
http://leonardof.org