The National Archives Catalogue on Excel

Discussion in 'Research Material' started by Andrea D, Oct 10, 2010.

  1. gaspirator

    gaspirator Member

    I've managed to get a block of results using Lee's technique - just had to break the series down into year groups, ie run the above searches with limits of 1939-40, then repeat with 1941-44 etc, (or whatever's appropriate) bracketing down the time period if you hit more than 1000 results.

    More work, but successfully got 18,000 results from it. Problems when the series has more than 10,000 records, as you then have to search for "[ref]/10*", then "[ref]/11*" up to "[ref]/19*".

    Merged the spreadsheets, removed duplicate records, copied 'citable ref' column to right end of columns and 'split columns' using "/" as delimiter. Sorted according to the piece numbers, then simple formula copied and pasted to flag where the numerical sequence was broken to identify omissions.

    I had about 20 issues, 10 of which were due to an extra "/A" etc in the ref, about 6 results didn't come through because the number was not used, leaving me with 4 records to grab manually.

    If you've got time, it does work. I can expand on the above if anyone wants, but if Lee's going to devise a more efficient method, may not be worth it...

    By the way, always make sure you search for "FO 898/1*" as the search box can search for " FO AND 898 AND 1* ", which is not what you want. Use the 'exact word or phrase' box.

    - Pete
     
    PsyWar.Org likes this.
  2. PsyWar.Org

    PsyWar.Org Archive monkey

    Good effort Pete!

    Another way of reducing the number of results is to use "?" in the search query.

    e.g. "WO 171/1???" will only return results between WO 171/1000 to WO 171/1999;

    To get WO 171/1 to WO 1/999, you could do a search with "WO 171/*" and then sort results by "reference".

    The problem is there are likely still to be some missing references if there are items like "WO 171/1A" and WO 171/1B" or "WO 171/10/1" included. Also you'll need to filter out any sub-division titles - i.e. restrict the search just to "Piece" in the "Catalogue level" section.

    So it might be better to do searches like "WO 171/9??", etc. to get the remainder.

    Still not an ideal solution, I'm afraid.


    Lee
     
  3. PsyWar.Org

    PsyWar.Org Archive monkey

    This is so far the best solution I can come up with using Discovery's built in functions:
    (Incidentally, I since realised that sorting on "URLparams" does not always work)

    Using record series WO 166 as an example, what needs to be done is to grab the results into separate spreadsheets in the following order:

    WO 166/1 to WO 166/9;
    WO 166/10 to WO 166/99;
    WO 166/100 to WO 166/999;
    WO 166/1000 to WO 166/1999;
    WO 166/2000 to WO 166/2999;
    ... etc.


    This is how to do it using Discovery Advanced Search: http://discovery.nationalarchives.gov.uk/SearchUI/search/advanced-search

    Enter in:
    Keyword Options > all of these words = *
    Search Within > Reference = WO 166/?

    Hit Search button.

    On the search results page:
    sort by = Reference

    Hit Download button and save as Spreadsheet (CSV).

    Now need to repeat this for WO 166/?? and WO 166/???.

    Then do this again for WO 166/1???, WO 166/2???. etc until all results have been grabbed.
    However in many cases there will be more than 1000 results. A workaround for this is to filter the results by Catalogue Level
    First select Piece only (usually should have 999 results);
    Then do it again for Item only (usually should have much less than 1000 results);
    Remember to sort by reference each time.

    Then it is a case of copying and pasting all the spreadsheets into one master spreadsheet, do it in order to help keep it sorted by reference.


    This way should get everything and avoid picking up duplicate records. However, it's not ideal with record series with thousands of sub-items like AIR 1 but fine in most cases. The important thing is to check on the number of search results, if more than 1000 than using the Catalogue Level filter to get under 1000.

    I'll try to add a couple of screen shots later to make it clearer.


    Lee
     
    Aixman, CL1, Drew5233 and 2 others like this.
  4. bamboo43

    bamboo43 Very Senior Member

    Thanks for all your efforts Gents, these pathway tips are incredibly useful to us all.

    I've been on some other forums lately and there is a litter of posts complaining about recent search engine upgrades and changes, particularly for FMP and London Gazette.
     
    PsyWar.Org likes this.
  5. Drew5233

    Drew5233 #FuturePilot 1940 Obsessive

    Once I, well Andrea, can figure out how to do it I'll add more series to Excel and post them on this thread. One I'm particularly interested in doing is WO 305 series which has some inter-war unit war diaries and Falklands War unit war diaries in it.
     
  6. PsyWar.Org

    PsyWar.Org Archive monkey

    Andy, the method in post #103 will work fine with WO 305.

    Although bear in mind the series is still accuring so the spreadsheet will need updating once in awhile.
    The series had almost weekly additions for much of last year, especially with records from the 1960s and 1970s.

    Actually it's probably worth uploading as separate spreadsheets in 1000 piece ranges. Then just the last spreadsheet with need to be updated as more records are added later. What I mean is do a spreadsheet for WO 305/1 to WO 305/999, another for WO 305/1000 to WO 305/1999, etc.

    Lee
     
    Drew5233 likes this.
  7. gaspirator

    gaspirator Member

    Only just found time to use your method Lee, but it works!

    Just a couple of pitfalls I noticed on WO 199.

    I did as per your method as far as “WO 199/3???” as this retrieved WO 199/3392, which is the final file in WO 199. “WO 199/4???” returned no results, so I stopped searching. I didn’t need to resort to filtering by piece/item, as no searches were over 1,000 hits.

    The catalogue states that there are 3392 files, but my combined (and sorted) spreadsheet put WO 199/3392 on row 3385 (or 3384 if you remove the header row).
    I knocked up a formula to flag where the numerical sequence was broken and found 10 gaps, due to the problem of an extra A/B/C on the number.

    Example: searching on “WO 199/??” will include WO 199/18, which is a blank record that refers you to WO 199/18A and WO 199/18B; neither of these two latter appear in this search as the reference is longer. They do, however, appear (as you would expect) in the next search “WO 199/???”

    This is not a problem (except for sorting) until you get reference lengths that are longer than you anticipate. My searches picked up WO 199/356A and 356B as a matter of course, but the next case, WO 199/911A and 911B and higher numbers were not picked up in the original series of searches.

    I stopped at “WO 199/4???” - to locate the WO 199/911A/B files I would have had to have continued searching up to “WO 199/9???”. This brings back these and another A/B pair at WO 199/954.

    As my formula listed where the gaps were, I could target searches, so “WO 199/2????” returned 11 new results starting at WO 199/2090A.

    As a result I think WO 199 has 3408 database records (including WO 199/18 which is a blank record and doesn’t have a physical file associated with it), but even without, there seem to be 16 physical files more than the catalogue states. I assume the basic catalogue entry is simply using the highest piece number and not allowing for these extra A/B/C files.

    In a nutshell, there may be extra files lurking with longer references that you won’t pick up unless you extend your searches.
    Searching for “WO 199/???A” will return WO 199/911A and several others with a 3-digit number appended with an “A” but it only works if you know that A/B/C is being used, and not some other system.

    Using the three reference fields filled with “WO 199/???A”, “WO 199/???B” and “WO 199/???C” or perhaps “WO 199/???A” and “WO 199/????A”, speeds things up, if you’re confident that the extra character is consistent - you can then repeat with “B” and “C”.

    There must be an easier way to do this, Lee?

    - Pete
     
  8. PsyWar.Org

    PsyWar.Org Archive monkey

    Pete, those extra A and Bs do throw a spanner in the works.

    But everything will be picked up if continuing the sequence through to the bitter end, til WO 199/9??? is reached.

    Although you'd still need to do a final check up with WO 199/????? to get, in this case, the last 15 results like: WO 199/2979A, etc.

    And if the record series has more than 9,999 pieces then it gets very tedious indeed.


    It's not an ideal solution but the best work around I can come up with, without resorting to more extreme solutions that might upset Kew.
     
  9. PsyWar.Org

    PsyWar.Org Archive monkey

    Oops I was forgetting the 1000 limit again, so those A and Bs are going to throw the results over 1000 again. Back to the drawing board.
     
  10. geoff501

    geoff501 Achtung Feind hört mit

    So is it to be page scrape?
    (my third favorite pastime)
     
  11. gaspirator

    gaspirator Member

    I still find your method of immense use though Lee!

    I've now tried searching for "WO 199/*A", "WO 199/*B" and "WO 199/*C" to grab all odd records first, then search "WO 199/1???" etc, but use the search to exclude "WO 199/*A" etc.

    I've pretty much got the data I need now, but I'll throw any thoughts/queries up if I encounter any more!

    - Pete
     
  12. PsyWar.Org

    PsyWar.Org Archive monkey

    That would be the best solution Geoff. ;) Although might upset Kew if it was out in the wild.
     
  13. geoff501

    geoff501 Achtung Feind hört mit

    ( got 6.5 million pages from there before now :ph34r: )
     
  14. RAFCommands

    RAFCommands Senior Member

    Lee,

    I use a way to download..other than a datascrape... a series but it involves a different way than the excel download button and it outputs xml which I use for post processing.

    Drop me a PM for details, if you want chapter and verse.

    I'll have a play with WO 199 and see if I can use the method via the download button on Discovery later tonight and report back.

    Regards
    Ross
     
  15. PsyWar.Org

    PsyWar.Org Archive monkey

    Ross, would certainly be interested in your method.

    No problem converting the XML to a spreadsheet, it's getting around the 1000 record download limit that's the pain in the backside for me.
    So all suggestions are gladly welcomed :)

    Edit: Ross is your method using the RESTful API? This used to be controlled by IP address access but looks like it's open to all now.

    Lee
     
  16. RAFCommands

    RAFCommands Senior Member

    Yup the Beta API,

    Best to apply for permission (quickly given) as the use is monitored carefully and unknown IPs rapidly blocked.

    Tests shown NBG with the download button

    Will drop an PM with details of the workround.

    Ross
     
  17. PsyWar.Org

    PsyWar.Org Archive monkey

    With the recent update to Discovery wildcard searches using "?" and "*" have been removed.
    So now not even those convoluted instructions I posted above will work anymore to export search results over a thousand hits.
     
  18. geoff501

    geoff501 Achtung Feind hört mit

    Why do they change these things? 90% of searching algorithms is common sense, a no brainer, yet it is not often done. Have just discovered that the SWB data on Ancestry can be searched on badge number, but not regimental number, even though these numbers are indexed. At least CWGC name search is now fixed.
     
  19. PsyWar.Org

    PsyWar.Org Archive monkey

    I guess they'd cite for performance reasons as they've added lots of non-National Archives data to the search engine. Personally I'd prefer they stick to purely their own archives and make the search engine as flexible as possible. There are a couple of good design improvements though.
     
  20. Drew5233

    Drew5233 #FuturePilot 1940 Obsessive

    I thought I had added this series on Excel already:

     

    Attached Files:

Share This Page