PDF Indexer


Completely rewritten for Joomla 1.5.
Allow PDFs to be indexed and searched via the Joomla/Mambo search module. This component allows you to index PDFs on your site and the corresponding plugin (mosbot) allows that index to be searched. Directories that contain PDFs can either be set to public, registered, or do not index. Will work on servers using Linux, Windows, OSx, and FreeBSD.
Will not work in safemode nor will it work when popen is disabled.
Does not work with godaddy hosting.
The installation was flawless and other than a minor hiccup (perhaps by me) it has worked seamlessly from the first time it was installed. Future development, and financial support, is what this plug-in is worthy of.
I have to other digital archivists recommended this plug in multiple times and will continue to do so in the future. It is head and shoulders above other similar offerings.
MrDog, you said you have a small change that you made so that the index will point to DOCman records, could you post that somewhere?
The developer seems to have abandoned PDF Indexer after he had his child. Since PDF Indexer is free software released under GPL, maybe it would be possible to make an updated version that supports at least UTF-8 indexing out of the box?
There was an earlier comment on integration with DOCMAN which is true - it does not respect DOCMAN restricted access. But I wrote a very small change which seems to overcome this so if anyone is interested contact me
Hi MyDog,
Do you mind shooting me an email with the change you wrote. I would like to take a look and include it in my latest release.
Thanks,
Nate Maxfield
nate[at]natemaxfield.com
I encountered the problem of reading pdfs in Latin1 encoding, so database population stopped at special characters like ë and à. Because Joomla 1.5 is utf-8, you need to get the text from the pdf in utf-8 also. You can change the command line of the pdftotext command for this. In admin.file_index.php near line 350 you change
pdftotext \"$original_name\" - 2>&1
to
pdftotext -enc UTF-8 \"$original_name\" - 2>&1
watch for the local and the component command
I don't know if this works for the Windows command.
Great component!
However, if you use DOCman be aware that these extenstions - PDF Indexer & Doc Indexer - do not care about what security settings you have enabled in DOCman.
So if you have a Private / Restricted / Hidden file in DOCman it won't matter. Using the standard Joomla Search with this addon will find that file and allow ANY USER to open it.
Like I said, it's a good extension and does what it should. But just be aware that it creates a security hole if you're using DOCman.
This will enable the standard Joomla! Search to index DOCman documents. Next, follow the instructions on how to index PDFs through PDF Indexer from this page. Now everything will fall in place ...
**In fact I have disabled the standard 'DOCman Search' from my site, since it does not 'highlight' the search items.
For an organization with half our content tied up in PDF publications, this was a must-have.
Thank you, thank you, thank you.
I bought it a few months ago and was pleased with it as i use a lot of pdf files.
It does indexes well but i did discover a big security problem.
When people use the search and find contents from your pdf it shows exactly where the files are.
The address isnt hidden is any way.
I contacted the author asking him to take care of this.
He said .."that shouldnt be to difficult to do"..
But still nothing happend.
Next excuse was that were some changes in his private life and i said...ok i will wait.
Long time after that...same excuse..
Nothing is happening.
So its very disapointing that some authors dont take complaints serious,not even with commercial components.....in this case anyway.
First off, the change in my private life is I had my first child. Any first time parent knows the first two months are extremely hard and running on 4 hours of sleep isn't idea for writing code.
Secondly, this has been addressed in the latest version of PDF indexer.
Finally, this customer received a full refund because he misunderstood what the component was meant for.
Overall, it's a good extension that I would recommend without hesitation. I have almost 2000 pdfs on my site and it had no problem with indexing them all.
The install went simply as do most Joomla components. The initial indexing of 100 pdf files proved to be a resource hog as it send warning bells and whistles to my isp. PDF Indexer used over 20% of cpu resources.
My account was immediatley shut down by a linux script and I had to call my ISP to have my account reinstated. They told me that it was most likely caused by poorly written code and it is rare they have many problems with Joomla (which they advertise).
After the initial indexing the program will skip the previously indexed files and just index new files added. I would suggest that you index maybe 10 files at a time.
I informed the author and I think he may be looking into this. It would probably be easy to add some sort of timer when indexing that allows 5 files to be indexed every minute or so. If you have your own server this will not be an issue, but for those of us that are on shared systems we will get hammered by our isp's. Like I said this should be a simple fix adding a couple of lines of code.
The second issue I have is that most of my pdf files were created by Adobe acrobat 7. This caused pdf indexer to not work properly and gave errors on every file created with acrobat 7 (uses AES encryption - version 1.6 pdf files).
The reason behind this is because pdf indexer uses the program xpdf for its pdf to text conversion and then dumps the converted text into your database.
It is a widely known issue that xpdf is not capable of reading version 1.6 pdf files at this point in time. Therefore there were errors with every pdf ver 1.6 file. It works fine with ver 1.5 and below.
If I would have known all this up front I would have not purchased the program in its current state.
Once you get past these problems on the backend, the frontend appears to work well in that the pdf's that did index showed up properly in a search.
I feel that the author is not very responsive to these issues for charging for a product that only half works.
This can be a great product, but it relies on an outdated engine (xpdf). The author should let folks know this up front before they purchase.
"PDF Indexer used over 20% of cpu resources."
This is the first report of this I have seen.
"It is a widely known issue that xpdf is not capable of reading version 1.6 pdf files"
It does read and index the files. There is just a warning at the beginning of the index.
I responded to this user within 1 hour of his complaint and then never heard from him again.
If you need to index pdf's consider trying this component.
If you want to know in which webpages I have used it, for reference, send me an email to:
jose@joseargudo.es






